Inconsistency between gene_grouped_counts_linear.tsv and read_assignments.tsv.gz?

qsonehara commented 1 week ago

Hi, thank you for developing this tool.

I would like to ask about the interpretation of the read assignments file. I'm trying IsoQuant version 3.6.1 on single-cell data with the --no_model_construction option. The gene-expression output file (OUT.gene_grouped_counts_linear.tsv) had several cells with a low read count. For example, picking up a cell with a barcode CAGCAGCGTTGGGACA:

grep -w CAGCAGCGTTGGGACA OUT.transcript_grouped_counts_linear.tsv

reveals only one line with a count of 1.00: ENST00000549920.6 CAGCAGCGTTGGGACA 1.00

Meanwhile, when I looked into the read assignment file (OUT.read_assignments.tsv.gz), this cell had a total of 696 assignments, including 251 unique ones:

zgrep -w "CB=CAGCAGCGTTGGGACA" OUT.read_assignments.tsv.gz | cut -f6 | sort | uniq -c

54 ambiguous 52 inconsistent 94 inconsistent_ambiguous 177 inconsistent_non_intronic 13 noninformative 251 unique 55 unique_minor_difference

Representative lines:

#read_id        chr     strand  isoform_id      gene_id assignment_type assignment_events       exons   additional_info
molecule/2117770        chr1    +       ENST00000374550.8       ENSG00000142676.14      unique  fsm,tes_match:8,correct_polya_site_right:23696423       23691806-23691829,23692609-23692759,23693807-23693913,23694660-23694791,23695798-23695908,23696344-23696425     gene_assignment=unique; PolyA=True; Canonical=True; Classification=full_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117771        chr1    +       ENST00000374550.8       ENSG00000142676.14      unique  fsm,tes_match:8,correct_polya_site_right:23696423       23691806-23691829,23692609-23692759,23693807-23693913,23694660-23694791,23695798-23695908,23696344-23696425     gene_assignment=unique; PolyA=True; Canonical=True; Classification=full_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117773        chr1    +       ENST00000374550.8       ENSG00000142676.14      unique  fsm,tes_match_precise:1,correct_polya_site_right:23696416       23691806-23691829,23692609-23692759,23693807-23693913,23694660-23694791,23695798-23695908,23696344-23696418      gene_assignment=unique; PolyA=True; Canonical=True; Classification=full_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117774        chr1    +       ENST00000374550.8       ENSG00000142676.14      unique  fsm,tes_match_precise:1,correct_polya_site_right:23696416       23691806-23691829,23692609-23692759,23693807-23693913,23694660-23694791,23695798-23695908,23696344-23696418      gene_assignment=unique; PolyA=True; Canonical=True; Classification=full_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117778        chr1    +       ENST00000458455.2       ENSG00000142676.14      unique  fsm,correct_polya_site_right:23696414   23692608-23692759,23693807-23693913,23694660-23694791,23695798-23695908,23696344-23696412       gene_assignment=unique; PolyA=True; Canonical=True; Classification=full_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117341        chr1    +       ENST00000373812.8       ENSG00000198492.16      unique  fsm,correct_polya_site_right:28769774   28736992-28737147,28737658-28737682,28738259-28738338,28742403-28743986,28768929-28769773       gene_assignment=unique; PolyA=True; Canonical=True; Classification=full_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117960        chr1    -       ENST00000531243.2       ENSG00000090621.15      unique  mono_exonic,tss_match_precise:2 39576661-39576789       gene_assignment=unique; PolyA=False; Canonical=Unspliced; Classification=incomplete_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117690        chr1    +       ENST00000304979.8       ENSG00000171960.12      unique  fsm,tes_match_precise:0,correct_polya_site_right:42676757       42658424-42658512,42658844-42658908,42659228-42659251,42659522-42659566,42660862-42660904,42664863-42664955,42665980-42666067,42666547-42666587,42667351-42667440,42676584-42676758      gene_assignment=unique; PolyA=True; Canonical=True; Classification=full_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117928        chr1    +       ENST00000436427.1       ENSG00000065978.19      unique  mono_exonic,tes_match_precise:0,correct_polya_site_right:42702349       42702036-42702349       gene_assignment=unique; PolyA=True; Canonical=Unspliced; Classification=incomplete_splice_match; CB=CAGCAGCGTTGGGACA;
molecule/2117801        chr1    +       ENST00000311672.10      ENSG00000173660.12      unique  fsm,tss_match_precise:0,tes_match_precise:-3,correct_polya_site_right:46316774  46303698-46303820,46309101-46309127,46310155-46310316,46316552-46316773 gene_assignment=unique; PolyA=True; Canonical=True; Classification=full_splice_match; CB=CAGCAGCGTTGGGACA;

How do I interpret this result?

Thanks!

qsonehara commented 6 days ago

After some inspections, I noticed that I get different data in the linear-format output files every time I rerun IsoQuant. It seems that the second column (group_id) is shuffled.

I was wondering if

for i, g in enumerate(read_groups):

at line 250 of long_read_counter.py might be

for i, g in enumerate(self.ordered_groups):

https://github.com/ablab/IsoQuant/blob/51d39e34d3715da4b82e01de33597ebea300d029/src/long_read_counter.py#L243C1-L252C41

That seems to affect the correspondence between read group strings and numeric ids and eventually the linear output formatting at line 393.

https://github.com/ablab/IsoQuant/blob/51d39e34d3715da4b82e01de33597ebea300d029/src/long_read_counter.py#L388C1-L394C35

Thanks!

andrewprzh commented 2 days ago

Dear @qsonehara

Apologies for delays - I've been away for a while.

Thanks a lot for the investigation, yes, you are absolutely right, that causes inconsistency between self.ordered_groups and self.group_numeric_ids.

Thankfully this bug was introduced rather recently and only affects linear format. But anyway, apologies for not noticing this and huge respect for finding it. I make a bug-fix release ASAP.

Best Andrey

ablab / IsoQuant

Inconsistency between gene_grouped_counts_linear.tsv and read_assignments.tsv.gz? #258