read assignments file interpretation

sparthib commented 3 months ago

Hi @andrewprzh , I am going through the read assignments file and looks like there are instances with multiple entries of same read ID, each entry has a different isoform listed althought it's the same gene id. How do I go about interpreting this? I'd assumed each read ID would end up being assigned a unique isoform. Thanks!

read_id                                   transcript_id            gene_id
298  368bdc48-89e2-43d6-9321-8a60419ca652 ENST00000477740.5  ENSG00000238009.6
299  368bdc48-89e2-43d6-9321-8a60419ca652 ENST00000610542.1  ENSG00000238009.6
2521 e7ca5b35-94d1-4608-aff7-771de4fa1fe2 ENST00000429505.6 ENSG00000237491.10
2522 e7ca5b35-94d1-4608-aff7-771de4fa1fe2 ENST00000651411.1 ENSG00000237491.10
2523 e7ca5b35-94d1-4608-aff7-771de4fa1fe2 ENST00000655765.1 ENSG00000237491.10
2524 e7ca5b35-94d1-4608-aff7-771de4fa1fe2 ENST00000656571.1 ENSG00000237491.10

andrewprzh commented 3 months ago

Dear @sparthib

If a read is truncated (which is quite typical for ONT data), it may be impossible to unambiguously assign the isoform, especially for complex genes with multiple isoforms. Such reads are reported as ambiguous and may be printed more than once.

For examples: Isoforms: [ ]----[ ]----[ ]----------[ ] [ ]----[ ]----[ ]----[ ] [ ]----[ ]----[ ]------[ ]----------[ ] Read: [ ]----[ ]----[ ]

The read matches to all 3 isoforms equally well, so IsoQuant cannot tell which one exactly is it.

Best Andrey

jamestwebber commented 2 months ago

I have a related question to this: how should I interpret differences between the read_assignments file and the transcript_model_reads file? Should I trust one over the other?

When I try to match up the read names here, I see all sorts of combinations

reads that read_assignments assign to a known transcript, but transcript_model_reads doesn't assign to anything (I believe this is how I should interpret the * entry in that file).
reads that are "noninformative" in the former file but assigned to some novel NNIC in the latter
reads assigned ambiguously at first but then to a known transcript
etc

I suppose transcript_model_reads is downstream of read_assignments, and so I should prefer that file. But in that case I might not know what kind of match it is (e.g. ISM, FSM), because there isn't a link back to the read assignment that was used. Is there something I'm missing here?

I'm using these results to interpret how different sequencing protocols affect our isoform results, so I want to make sure I'm using the best interpretation I can.

andrewprzh commented 2 months ago

@jamestwebber

Since IsoQuant performs 2 independent steps (reference-based assignment / quantification and transcript discovery), there might be some discrepancies between the results (e.g. read assignments, isoform counts).

reads that read_assignments assign to a known transcript, but transcript_model_reads doesn't assign to anything (I believe this is how I should interpret the * entry in that file).

Yes * is unassigned read. This may happen if transcript discovery algorithm decides that known isoform has not enough support to be reported (transcript_models.gtf contain only transcripts that deemed to be expressed). Thus, since this isoform is not reported, respective read is marked as unassigned.

reads that are "noninformative" in the former file but assigned to some novel NNIC in the latter

Non-informative generally means that read has small overlap with known exons. Thus, if the NNIC isoform is located in intergenic or intronic region, its supporting reads may be classified as non-informative during the first step.

reads assigned ambiguously at first but then to a known transcript

Similarly to the first case, if a read assigned to both isoform A and B during the first step, but only one of them is reported in transcript_models.gtf, this read now has a unique assignment.

Overall, if you are looking for read statistics I'd say read_assignments.tsv is more informative and reliable since it does not depend on transcript discovery thresholds and simply matches reads to known transcripts. Of course, it does not include any novel isoforms, but can be used to counts stats for FSM/ISM, TSS/TES matches ect. transcript_model_reads is mostly a supporting information and does not derive from read_assignments, reads are generally re-assigned from scratch during the second step. So to sum up, I'd use read_assignments.tsv to count reads stats and used files from the second step only in case I'm interested in novel isoforms.

I also think that there is more consistency between 2 steps in the latest version and there will be some more updates soon.

Best Andrey

andrewprzh commented 2 months ago

I'll close this issue for now, please, re-open if needed.

jamestwebber commented 1 month ago

A follow-up question: is there any reason a read wouldn't be in transcript_model_reads.tsv.gz? I expected every read from my input to be present in that file, but some are missing.

I'm wondering if there's some filtering step, but I see the read in a different run on the same file, so that seems unlikely to me. Unless it's filtering out certain transcript assignments or something?

andrewprzh commented 1 month ago

@jamestwebber

There are some default MAPQ filters applied even before read assignment. So in the latest version I'd expect all reads from read_assignments.tsv to be in transcript_model_reads.tsv.gz as well, but some reads from the initial BAM file can be discarded. Also, unmapped reads and supplementary alignments are ignored.

ablab / IsoQuant

read assignments file interpretation #204