Closed sparthib closed 2 months ago
Dear @sparthib
If a read is truncated (which is quite typical for ONT data), it may be impossible to unambiguously assign the isoform, especially for complex genes with multiple isoforms. Such reads are reported as ambiguous and may be printed more than once.
For examples: Isoforms: [ ]----[ ]----[ ]----------[ ] [ ]----[ ]----[ ]----[ ] [ ]----[ ]----[ ]------[ ]----------[ ] Read: [ ]----[ ]----[ ]
The read matches to all 3 isoforms equally well, so IsoQuant cannot tell which one exactly is it.
Best Andrey
I have a related question to this: how should I interpret differences between the read_assignments
file and the transcript_model_reads
file? Should I trust one over the other?
When I try to match up the read names here, I see all sorts of combinations
read_assignments
assign to a known transcript, but transcript_model_reads
doesn't assign to anything (I believe this is how I should interpret the *
entry in that file).I suppose transcript_model_reads
is downstream of read_assignments
, and so I should prefer that file. But in that case I might not know what kind of match it is (e.g. ISM, FSM), because there isn't a link back to the read assignment that was used. Is there something I'm missing here?
I'm using these results to interpret how different sequencing protocols affect our isoform results, so I want to make sure I'm using the best interpretation I can.
@jamestwebber
Since IsoQuant performs 2 independent steps (reference-based assignment / quantification and transcript discovery), there might be some discrepancies between the results (e.g. read assignments, isoform counts).
reads that
read_assignments
assign to a known transcript, buttranscript_model_reads
doesn't assign to anything (I believe this is how I should interpret the * entry in that file).
Yes *
is unassigned read. This may happen if transcript discovery algorithm decides that known isoform has not enough support to be reported (transcript_models.gtf
contain only transcripts that deemed to be expressed). Thus, since this isoform is not reported, respective read is marked as unassigned.
reads that are "noninformative" in the former file but assigned to some novel NNIC in the latter
Non-informative generally means that read has small overlap with known exons. Thus, if the NNIC isoform is located in intergenic or intronic region, its supporting reads may be classified as non-informative during the first step.
reads assigned ambiguously at first but then to a known transcript
Similarly to the first case, if a read assigned to both isoform A and B during the first step, but only one of them is reported in transcript_models.gtf
, this read now has a unique assignment.
Overall, if you are looking for read statistics I'd say read_assignments.tsv
is more informative and reliable since it does not depend on transcript discovery thresholds and simply matches reads to known transcripts. Of course, it does not include any novel isoforms, but can be used to counts stats for FSM/ISM, TSS/TES matches ect. transcript_model_reads
is mostly a supporting information and does not derive from read_assignments
, reads are generally re-assigned from scratch during the second step. So to sum up, I'd use read_assignments.tsv
to count reads stats and used files from the second step only in case I'm interested in novel isoforms.
I also think that there is more consistency between 2 steps in the latest version and there will be some more updates soon.
Best Andrey
I'll close this issue for now, please, re-open if needed.
A follow-up question: is there any reason a read wouldn't be in transcript_model_reads.tsv.gz
? I expected every read from my input to be present in that file, but some are missing.
I'm wondering if there's some filtering step, but I see the read in a different run on the same file, so that seems unlikely to me. Unless it's filtering out certain transcript assignments or something?
@jamestwebber
There are some default MAPQ filters applied even before read assignment. So in the latest version I'd expect all reads from read_assignments.tsv
to be in transcript_model_reads.tsv.gz
as well, but some reads from the initial BAM file can be discarded. Also, unmapped reads and supplementary alignments are ignored.
Hi @andrewprzh , I am going through the read assignments file and looks like there are instances with multiple entries of same read ID, each entry has a different isoform listed althought it's the same gene id. How do I go about interpreting this? I'd assumed each read ID would end up being assigned a unique isoform. Thanks!