BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data

sarahcalvo commented 12 months ago

I'm working with Brian Haas to use Isoquant to reconstruct genes in an amoeba species. This species has genes tightly packed together with almost no intergenic space, and genes typically have many introns.

Isoquant is completely missing several obvious 1- and 2- exon genes, when run either with default params or any of the following parameters: --report_novel_unspliced "true" ; --model_construction_strategy "sensitive_pacbio"; --fl_data

I created a tiny BAM file with a single region with 8 genes (~900 reads total), of which Isoquant only calls transcripts for 6 (using any of the above flags).

Here are some example files in directory https://personal.broadinstitute.org/scalvo/for_isoquant_debugging/

Isoquant.example_missing_genes.pptx : screen shot showing 2 missing genes
T1.r1.mini.bam : bam file with ~900 reads, that should have 8 transcripts
out_mini_default : directory with output of isoquant using default parameters (run on this mini.bam file)

Any suggestions?

andrewprzh commented 11 months ago

Dear @sarahcalvo

Based on my experience, mono-exonic and single-intron alignments can be incorrect significantly more often compared to alignments with 3 or more exons. Thus, IsoQuant performs additional checks for these alignments, for example filters them out based on mapping quality or presence of polyA tail. I presume some of these filters may affect your results.

Thank you for the data, I have some tight schedule at the moment, hope to get my hands on them ASAP.

Best Andrey

sarahcalvo commented 11 months ago

Dear @andrewprzh, I know this is a super busy time of year and I've been trying various work-arounds. But I just wanted to let you know that when you have a chance to look into this (in the new year) -- I'm super eager to follow-up! Best, and happy holidays -- Sarah

andrewprzh commented 11 months ago

Dear @sarahcalvo

The answer appeared to be simpler than I thought. IsoQuant is being extra careful about 1 and 2-exon alignments as they can be false positives (typically for ONT data). Thus, IsoQuant requires them to have a polyA tail in order to be used for transcript construction. You reads have no polyA tails and thus 1-2 exonic transcripts are entirely missed.

At least for 2-exon transcripts that can be easily fixed with an option. For monoexonic it might take some time as polyA position is essential for the detection.

Anyway, I'll see what can be done and hopefully will improve that in the next release.

Best Andrey

sarahcalvo commented 11 months ago

Thanks so much Andrey! This raises some very interesting biological hypotheses too that we will look into— unless it’s an artifact from the MAS-iso-seq processing pipeline. I’ll look into both options and let you know!SarahSent from my iPhoneOn Dec 26, 2023, at 5:09 AM, Andrey Prjibelski @.***> wrote: Dear @sarahcalvo The answer appeared to be simpler than I thought. IsoQuant is being extra careful about 1 and 2-exon alignments as they can be false positives. Thus, IsoQuant requires them to have a polyA tail in order to be used for transcript construction. You reads have no polyA tails and thus 1-2 exonic transcripts are entirely missed. At least for 2-exon transcripts that can be easily fixed with an option. For monoexonic it might take some time as polyA position is essential for the detection. Anyway, I'll see what can be done and hopefully will improve that in the next release. Best Andrey

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

andrewprzh commented 11 months ago

@sarahcalvo did you do some read cleaning/trimming before using IsoQuant? Because polyA is detected in exactly 0 reads.

Best Andrey

sarahcalvo commented 11 months ago

MAS-ISO-seq is a new experimental method where cDNA transcripts are concatenated together then sequenced with PacBio. They have developed a software pipeline that processes the long concatenated consensus reads into the transcript reads. I hadn’t realized none of the reads had polyA but my guess is the mas-ISO-seq pipeline trims the polyA as part of its processing. I’ll ask Brian Haas to confirm.Sarah Sent from my iPhoneOn Dec 26, 2023, at 6:57 AM, Andrey Prjibelski @.***> wrote: @sarahcalvo did you do some read cleaning/trimming before using IsoQuant? Because polyA is detected in exactly 0 reads. Best Andrey

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

sarahcalvo commented 11 months ago

Yes just confirmed that PolyA is trimmed from the ends of the reads as part of the standard/official mas- Isoseq processing pipeline, and so shouldn't show up in any of the reads that get aligned to the genome.Sarah Sent from my iPhoneOn Dec 26, 2023, at 8:04 AM, Sarah Calvo @.> wrote:MAS-ISO-seq is a new experimental method where cDNA transcripts are concatenated together then sequenced with PacBio. They have developed a software pipeline that processes the long concatenated consensus reads into the transcript reads. I hadn’t realized none of the reads had polyA but my guess is the mas-ISO-seq pipeline trims the polyA as part of its processing. I’ll ask Brian Haas to confirm.Sarah Sent from my iPhoneOn Dec 26, 2023, at 6:57 AM, Andrey Prjibelski @.> wrote: @sarahcalvo did you do some read cleaning/trimming before using IsoQuant? Because polyA is detected in exactly 0 reads. Best Andrey

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

andrewprzh commented 11 months ago

@sarahcalvo

I know that IsoSeq pipeline provide CSS headers that contain information about polyA tails detected in reads. I think I might implement their support at some point if that would be useful.

Meanwhile I improved reporting novel mono-intronic transcripts and added new options allowing to tune polyA usage by the user. This will come out in the next release.

Monoexonic transcripts are still in question as polyA positions are essential for clustering reads together.

Best Andrey

sarahcalvo commented 11 months ago

Thanks so much Andrey! I look forward to the next release.

Yes, it would definitely be helpful to have a version that supports IsoSeq (with info from polyA tails in the CSS headers). The technology seems to be working great.

Sarah

On Wed, Dec 27, 2023 at 9:18 AM Andrey Prjibelski @.***> wrote:

@sarahcalvo https://github.com/sarahcalvo

I know that IsoSeq pipeline provide CSS headers that contain information about polyA tails detected in reads. I think I might implement their support at some point if that would be useful.

Meanwhile I improved reporting novel mono-intronic transcripts and added new options allowing to tune polyA usage by the user. This will come out in the next release.

Monoexonic transcripts are still in question as polyA positions are essential for clustering reads together.

Best Andrey

— Reply to this email directly, view it on GitHub https://github.com/ablab/IsoQuant/issues/128#issuecomment-1870348238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIZYB7JQK7YIR3OVZMEI4ELYLQU37AVCNFSM6AAAAABAH3N5LCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZQGM2DQMRTHA . You are receiving this because you were mentioned.Message ID: @.***>

--

Sarah Calvo, Ph.D. Sr. Computational Biologist Broad Institute of MIT/Harvard @.*** 617-714-7687

andrewprzh commented 10 months ago

Dear @sarahcalvo

Would it be possible for you to share just a few line from any CCS hearer files if you happen to have any?

Best Andrey

sarahcalvo commented 10 months ago

Hi Andrey,

Yes! Sorry for the delay. This file has the info for the region in the mini bam I sent you before: https://personal.broadinstitute.org/scalvo/for_isoquant_debugging/T1.r1.mini.read.refine.report.csv

Here are a few lines: id,strand,fivelen,threelen,polyAlen,insertlen,primer m84043_231012_201426_s1/160961840/ccs/2694_3482,+,9,7,33,788,asa04_5p--3p m84043_231012_201426_s1/126490519/ccs/67_836,+,7,9,43,769,asa04_5p--3p m84043_231012_201426_s1/60360594/ccs/4737_5506,+,22,51,48,769,asa04_5p--3p m84043_231012_201426_s1/229184738/ccs/5806_6560,+,5,9,32,754,asa04_5p--3p m84043_231012_201426_s1/187438050/ccs/1980_2399,+,1,9,58,419,asa04_5p--3p m84043_231012_201426_s1/200545278/ccs/7144_7898,+,9,7,54,754,asa04_5p--3p m84043_231012_201426_s1/146020224/ccs/35_789,+,9,7,26,754,asa04_5p--3p m84043_231012_201426_s1/146085784/ccs/3011_3430,+,9,6,59,419,asa04_5p--3p m84043_231012_201426_s1/163125161/ccs/2731_3045,+,7,9,35,314,asa04_5p--3p m84043_231012_201426_s1/153490061/ccs/1383_2137,+,7,9,32,754,asa04_5p--3p

Sarah

On Fri, Jan 5, 2024 at 8:24 AM Andrey Prjibelski @.***> wrote:

Dear @sarahcalvo https://github.com/sarahcalvo

Would it be possible for you to share just a few line from any CCS hearer files if you happen to have any?

Best Andrey

— Reply to this email directly, view it on GitHub https://github.com/ablab/IsoQuant/issues/128#issuecomment-1878654847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIZYB7K62EBBIJRIKORRZMDYM75JDAVCNFSM6AAAAABAH3N5LCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGY2TIOBUG4 . You are receiving this because you were mentioned.Message ID: @.***>

--

Sarah Calvo, Ph.D. Sr. Computational Biologist Broad Institute of MIT/Harvard @.*** 617-714-7687

andrewprzh commented 6 months ago

I'll close this issue for now as original problem should be now solved in IsoQuant 3.4

Implementing CCS headers is on the roadmap for the next release.

ablab / IsoQuant

BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data #128

--

Sarah Calvo, Ph.D. Sr. Computational Biologist Broad Institute of MIT/Harvard @.*** 617-714-7687

--

Sarah Calvo, Ph.D. Sr. Computational Biologist Broad Institute of MIT/Harvard @.*** 617-714-7687