Open omarelgarwany opened 2 months ago
Dear @omarelgarwany
In fact, IsoQuant uses all reads for de novo transcript reconstruction (except maybe ones with low mapping quality). However, transcript reconstruction implements various filtering strategies. For example, to report a novel transcript, IsoQuant requires a certain number of full-length reads to support it (exact number depends on data type and read coverage in this particular genomic region). There are also some other filtering methods.
Overall, IsoQuant implements a rather conservative strategy, which allows to maintain high precision, in some case in cost of recall. If you are interested in particular splice junction (rather than novel full-length isoforms in general), you can try setting different strategies, e.g. --model_construction_strategy sensitive_pacbio
or even --model_construction_strategy all
. The later one may yield a lot of false positives thought.
Best Andrey
Thanks very much for answering the question.
I read a little more about the intron graph construction and that a novel path along this graph (i.e. transcript) requires at least 3 Pacbio full-length reads to support it (I assume inconsistent reads are not counted in the default --model_construction_strategy
).
So if I understand correctly, a novel transcript that has only 1 unique read and, say, 10 inconsistent reads supporting it will be discarded entirely even if a given splice junction is featured in all 11 reads. That makes total sense in terms of transcript discovery because we can't even be sure what that transcript looks like. But I guess may lead to missing some real splice junctions because the reads featuring it are not consistent with regards to the rest of the transcript.
Does that sound correct?
Thanks again.
In fact, during the isoform construction all reads are treated in the same way (consistent and inconsistent). Moreover, novel isoforms stem from inconsistent reads, since reads from unknown isoforms are not consistent with the reference annotation (and therefore marked as inconsistent during read assignment). Thus, even 3 inconsistent reads is enough. They just have to be full length, i.e. cover a potential novel transcript from TSS to polyA site.
However, if you have 10 non-full-length reads and only 2 full length, your novel transcript won't be reported, even though some junctions may be cover 12 times. So in these terms even high supported novel splice junctions may not make it to the resulting annotation.
Hope that clarifies the process, but don't hesitate to ask more questions.
Hi, @andrewprzh
Sorry to join the loop. Is there any "--min_count" setting for model construction? Like, FLAIR use min count 3 as default. I think it would be better to show the preset (--model_construction_strategy sensitive_pacbio
vs --model_construction_strategy sensitive_ont
) meaning? Since the ONT data evolved rapidly with basecalling development, when I do rough counting with bedtools coverage
, I could find more 3000 genes compared to IsoQuant default with ONT-cDNA R10 data.
Hi
I'm trying to replicate some splice junctions in my long-read Kinnex data. Out of a total of 115 splice junctions, I managed to replicate about 100 from quantification obtained via the isoseq/SQANTI pipeline after removing artefacts (e.g. monoexonic, TS artefact, intr-priming). When I replicated them in the GFF produced by Isoquant, I only replicated the existence of 38. Since I was sure most of these do exist in my long-read data, I wanted to check the ones I couldn't replicate in the GFF in the read_assignment file.
As expected, many of them were present in reads classified as inconsistent for all kinds of different reasons, including exon skipping, major exon elongation, incomplete intron retention etc. My question is why are these not used for novel transcript discovery. What determines if a read is used for novel transcript discovery or not? I have included an example belowl, but please let me know if you need to see more examples.
This is an example of a splice junction that I could find in the isoseq processed data, in the read assignments file, but not in the GFF (chr2:102362718-102375907:+):
read_assignments.tsv.gz:
GFF: Strangely the gene didn't exist in the GFF, but there are other examples of genes with transcripts in the GFF, with some splice junctions in the read_assignments file but not in the GFF.
Command: