Understanding how inconsistent reads are reused for de novo discovery

omarelgarwany commented 2 months ago

Hi

I'm trying to replicate some splice junctions in my long-read Kinnex data. Out of a total of 115 splice junctions, I managed to replicate about 100 from quantification obtained via the isoseq/SQANTI pipeline after removing artefacts (e.g. monoexonic, TS artefact, intr-priming). When I replicated them in the GFF produced by Isoquant, I only replicated the existence of 38. Since I was sure most of these do exist in my long-read data, I wanted to check the ones I couldn't replicate in the GFF in the read_assignment file.

As expected, many of them were present in reads classified as inconsistent for all kinds of different reasons, including exon skipping, major exon elongation, incomplete intron retention etc. My question is why are these not used for novel transcript discovery. What determines if a read is used for novel transcript discovery or not? I have included an example belowl, but please let me know if you need to see more examples.

This is an example of a splice junction that I could find in the isoseq processed data, in the read assignments file, but not in the GFF (chr2:102362718-102375907:+):

read_assignments.tsv.gz:


molecule/35283075       chr2    +       ENST00000409599.5       ENSG00000115604.12      inconsistent    exon_skipping_novel:102362719-102375906,major_exon_elongation_5:165     102355764-102356138,102356283-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384976 gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/60791975       chr2    +       ENST00000409599.5       ENSG00000115604.12      inconsistent    exon_skipping_novel:102362719-102375906,major_exon_elongation_5:131     102355798-102356138,102356283-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384974 gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/68810696       chr2    +       ENST00000233957.7       ENSG00000115604.12      inconsistent    exon_skipping_novel:102362719-102375906 102356269-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384973     gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/365814 chr2    +       ENST00000233957.7       ENSG00000115604.12      inconsistent    exon_skipping_novel:102362719-102375906 102356278-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384976     gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/47999392       chr2    +       ENST00000233957.7       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906 102356281-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384973     gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/47999392       chr2    +       ENST00000409599.5       ENSG00000115604.12      inconsistent_ambiguous exon_skipping_novel:102362719-102375906 102356281-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384973     gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/3981349        chr2    +       ENST00000233957.7       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906,incomplete_intron_retention_3:102376064-102381619       102356309-102356400,102362633-102362718,102375907-102377177     gene_assignment=inconsistent_ambiguous; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/3981349        chr2    +       ENST00000409599.5       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906,incomplete_intron_retention_3:102376064-102381619       102356309-102356400,102362633-102362718,102375907-102377177     gene_assignment=inconsistent_ambiguous; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/80230854       chr2    +       ENST00000233957.7       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906 102356309-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384998,102386861-102386895 gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/80230854       chr2    +       ENST00000409599.5       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906 102356309-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384998,102386861-102386895 gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/65835838       chr2    +       ENST00000233957.7       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906,incomplete_intron_retention_3:102387001-102390055       102356309-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384998,102386861-102387861 gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/65835838       chr2    +       ENST00000409599.5       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906,incomplete_intron_retention_3:102387001-102390055       102356309-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384998,102386861-102387861 gene_assignment=inconsistent; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/24814732       chr2    +       ENST00000233957.7       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906 102356309-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384998,102386861-102387000,102390056-102390217,102394469-102394627,102396531-102397198     gene_assignment=inconsistent_ambiguous; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;
molecule/24814732       chr2    +       ENST00000409599.5       ENSG00000115604.12      inconsistent_ambiguous  exon_skipping_novel:102362719-102375906 102356309-102356400,102362633-102362718,102375907-102376063,102381620-102381682,102384878-102384998,102386861-102387000,102390056-102390217,102394469-102394627,102396531-102397198     gene_assignment=inconsistent_ambiguous; PolyA=False; Canonical=True; Classification=novel_not_in_catalog;

GFF: Strangely the gene didn't exist in the GFF, but there are other examples of genes with transcripts in the GFF, with some splice junctions in the read_assignments file but not in the GFF.

Command:

${isoquant_bin} --reference ${genome_f} --genedb ${gtf_f} --complete_genedb --sqanti_output --yaml ${yaml_f} --data_type pacbio_ccs -o ${output_dir} --count_exons --check_canonical  --read_group tag:CB -t ${CPUS} --counts_format linear

andrewprzh commented 2 months ago

Dear @omarelgarwany

In fact, IsoQuant uses all reads for de novo transcript reconstruction (except maybe ones with low mapping quality). However, transcript reconstruction implements various filtering strategies. For example, to report a novel transcript, IsoQuant requires a certain number of full-length reads to support it (exact number depends on data type and read coverage in this particular genomic region). There are also some other filtering methods.

Overall, IsoQuant implements a rather conservative strategy, which allows to maintain high precision, in some case in cost of recall. If you are interested in particular splice junction (rather than novel full-length isoforms in general), you can try setting different strategies, e.g. --model_construction_strategy sensitive_pacbio or even --model_construction_strategy all. The later one may yield a lot of false positives thought.

Best Andrey

omarelgarwany commented 2 months ago

Thanks very much for answering the question.

I read a little more about the intron graph construction and that a novel path along this graph (i.e. transcript) requires at least 3 Pacbio full-length reads to support it (I assume inconsistent reads are not counted in the default --model_construction_strategy).

So if I understand correctly, a novel transcript that has only 1 unique read and, say, 10 inconsistent reads supporting it will be discarded entirely even if a given splice junction is featured in all 11 reads. That makes total sense in terms of transcript discovery because we can't even be sure what that transcript looks like. But I guess may lead to missing some real splice junctions because the reads featuring it are not consistent with regards to the rest of the transcript.

Does that sound correct?

Thanks again.

andrewprzh commented 2 months ago

In fact, during the isoform construction all reads are treated in the same way (consistent and inconsistent). Moreover, novel isoforms stem from inconsistent reads, since reads from unknown isoforms are not consistent with the reference annotation (and therefore marked as inconsistent during read assignment). Thus, even 3 inconsistent reads is enough. They just have to be full length, i.e. cover a potential novel transcript from TSS to polyA site.

However, if you have 10 non-full-length reads and only 2 full length, your novel transcript won't be reported, even though some junctions may be cover 12 times. So in these terms even high supported novel splice junctions may not make it to the resulting annotation.

Hope that clarifies the process, but don't hesitate to ask more questions.

baozg commented 3 days ago

Hi, @andrewprzh

Sorry to join the loop. Is there any "--min_count" setting for model construction? Like, FLAIR use min count 3 as default. I think it would be better to show the preset (--model_construction_strategy sensitive_pacbio vs --model_construction_strategy sensitive_ont) meaning? Since the ONT data evolved rapidly with basecalling development, when I do rough counting with bedtools coverage, I could find more 3000 genes compared to IsoQuant default with ONT-cDNA R10 data.

ablab / IsoQuant

Understanding how inconsistent reads are reused for de novo discovery #243