BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
201 stars 69 forks source link

ITD detection? #316

Open itslittman opened 6 months ago

itslittman commented 6 months ago

I have a sample with FLT3-ITD and another with a different ITD, and FLAIR doesn't seem to recognize mutant transcripts as distinct isoforms. I used trust_ends and included "inconsistent" reads from the FLAIR Correct output. I don't expect FLAIR to be recognizing the tandem duplications themselves, but I'm surprised it doesn't see the giant retained intron and report an isoform unique to the sample.

Jeltje commented 6 months ago

Do you have a bed file with aligned reads so we can have a look?

itslittman commented 6 months ago

@Jeltje I can try to make a bed file with selected ITD-containing reads. I tried to run flair collapse on the un-corrected bed file to see if it would pick it up that way, but it gave an error. Also, does the fact that I'm using the new RNA004 chemistry now make any difference as to what settings I should be using?

Jeltje commented 6 months ago

A minimal example input would be great!

I will ask about the chemistry, I'm guessing we haven't tried it ourselves yet.

itslittman commented 6 months ago

@Jeltje I exported FLT3 reads for an ITD sample and a non-ITD sample. FLAIR finds 2 annotated and 40 novel isoforms. Not only are most ITD reads assigned to a normal refseq transcript, but there are also non-ITD reads assigned to novel isoforms named after ITD reads. I understand this tool can't analyze the ITD insertions, but it shouldn't be completely blind to the retained intron.

itslittman commented 6 months ago

upload.zip @Jeltje I included BAMs, uncorrected BEDs, and FASTQs. Only FLT3 reads are included. Noah

Edit: As for the new kit, I’d imagine it would just affect how splice junctions are corrected? RNA4 + SUP model is magnitudes more accurate than RNA2.

Jeltje commented 6 months ago

Thanks! I ran Flair and had a look, but I'm not fully sure I understand the issue. Read-to-genome aligners generally require a start-to-end alignment of the read. As you're aware, they have no way to deal with ITD, so they just skip that part. Converted to bed, this means that an ITD read looks like a regular read, and this is why most of them collapse into the known transcript.

As for non-ITD reads collapsing into ITD (intron retention) transcripts, I'm not sure how to tell! --keep_intermediate gives me a count of reads per annotation but not which reads merge into which annotations. I'm a developer so I don't use the program as intended and I don't know much about the helper scripts. Can you enlighten me?

This might be easier to discuss with screen sharing, feel free to connect with me at jeltje@soe.ucsc.edu. I really want to get to the bottom of this.

itslittman commented 6 months ago

@Jeltje I don’t expect FLAIR to be able to deal with the actual tandem duplications - that would require mapping insertions back onto the read or training some sort of model. But since that’s beyond my abilities, I’m trying to indirectly look for ITD reads. If you drag each of my BAMs into IGV and go to the FLT3 gene, you’ll see the ITD reads have a retained intron which is not retained in any control reads. This is what happens with ITDs — you not only get insertions, but also a retained intron.

Shouldn’t FLAIR recognize the ITD transcripts as a unique isoform — not because of the insertions, but solely due to the retained intron? I expected to see an isoform with a retained intron that has 0 reads in control and many reads in the disease sample. FLAIR seems to not see the retained intron at all. Is it possible that the insertions are causing FLAIR to deem the splice site untrustworthy and disregard the retained intron? I can email you tomorrow and we can look closer.

Jeltje commented 6 months ago

Odd, I do see retained introns when I run Flair collapse on the ITD reads you sent. hgt_genome_a3dc_e28c20 Are you running the latest release (2.0)?

itslittman commented 6 months ago

@Jeltje Ahh okay. My bed files look the same - I see the retained intron unique to the ITD sample (FLAIR 1.7 but same bed result you have). Why would FLAIR Quantify not pick this up? That's the main issue. I can send you my transcript counts matrix.

EDIT: The Quantify counts matrix shows that in the ITD sample, only 120/649 reads correspond to isoforms that have 0 reads in WT control. The rest of the reads correspond to isoforms shared between the two samples (according to Quantify). This seems to be at odds with the bed file from the Collapse step, which correctly shows the wealth of isoforms unique to the ITD sample.

itslittman commented 5 months ago

@Jeltje any updates on this?

Jeltje commented 5 months ago

I'm not sure I understand the problem. When I run flair_quantify.py -r tmp_manifest.txt -i flair/collapse.isoforms.fa --isoform_bed flair/collapse.isoforms.bed -o test I get awk '{ sum += $3 } END { print sum}' test.counts.tsv = 69 (of 79) WT reads mapped to isoforms and 666 (of 821) ITD reads. Most of the WT reads map to a known isoform (ENST00000241453.12_ENSG00000122025.15), with only one or two reads mapping to new isoforms (which is below Flair's cutoff for finding new isoforms). The majority of ITD reads also map to that same isoform because that isoform is still present in that sample. But now you see the new isoforms all have >=3 reads, as expected.

itslittman commented 5 months ago

@Jeltje The result you describe isn't representative of the sample and is the same result I get. I manually counted 355 FLT3-ITD reads in my sample (which means FLAIR Quantify missed 66% of the ITD reads). I think it probably mischaracterized more than that too because some of the minor novel isoforms are also represented in WT. The ITD isoforms should unequivocally have 0 reads in WT.

If you drag the sample into IGV, you'll see that around 50% of intron 14-spanning reads contain ITDs, which makes sense for a sample consisting of 90-100% hemizygous blasts.

I'm of course not counting the ~25% of reads that are truncated and don't span intron 14 (or are short 3'UTR variants). Is FLAIR not already filtering those out, though? FeatureCounts shows 1310 FLT3 reads for this sample. FLAIR finds 821 reads -- that figure seems to be somewhere around what I'd expect for minimally-truncated reads. Something seems wrong here. I should be able to use this tool to see differentially-retained introns but I cannot.