Truncated, non-full length reads

vicasze commented 2 years ago

Hi,

I have 2 Direct RNA ONT libraries and my question is about reads that are not full length: due to the library protocol starting with a dT oligo and many reads not long enough to cover the whole transcript, I can see that there is an accumulation of reads on 3' of RNAs. I am wondering how these truncated reads are treated.

After running flair collapse, I get many transcripts derived from these not-full length reads and I believe these are not all new isoforms. I have tried several flair collapse parameters to try to remove those isoforms: --stringent, --filter nosubset, -n best_only and longest, but a lot of "incomplete new isoforms" still appear (see example of FN1 below).

Why are these isoforms being kept after collapse? Could you maybe help me with the right parameters to remove these truncated isoforms and get a confident set of isoforms?

igv_snapshot

These are the commands I used after alignment:

python flair.py correct \ -q mapping/$file.bed12 \ -g $FASTA \ -f $GTF \ -t 10 \ -j H0001_SJ_filt.tab \ -o flair/$file.sj

python flair.py collapse \ -q flair/all_corrected.sj.merged.bed \ -g $FASTA \ -r $control,$condition \ -f $GTF \ -o flair/collapse

Thank you very much

Jeltje commented 2 years ago

The Gene track looks as if this particular gene has quite a few existing isoforms. Flair collapse should only give you truncated isoforms if they cannot be matched with one of the longer ones.

Would you be able to share this particular region of the collapse output? I'd like to have a closer look at it. You can email me at jeltje@soe.ucsc.edu

Jeltje commented 2 years ago

Flair used to do poor error checking of input files, something we fixed in the last commit but haven't put in a release yet. This means that -r $control,$condition wouldn't throw an error, but it would skip those files altogether and still give you output.

Can you run flair collapse with space separated inputs: -r $control $condition and let us know what happens?

vicasze commented 2 years ago

I get the same output when running it with space

Jeltje commented 2 years ago

The long explanation is below, but TL;DR: try release 1.6.2 and use flair.py collapse --filter nosubset. If this fixes your problem, please close this ticket; otherwise let me know.

Flair collapse calls a subprogram called filter_collapsed_isoforms.py which is run after collapsing and before the final realignment step. An isoform that is a subset of another isoform will be filtered out unless the shorter isoform is more abundant that the average expression of the superset isoforms, in which case it is kept. When there are a lot of isoforms many will have low expression, which drives down the average. That means that relatively more short isoforms pass this filter. The change in flair 1.6.2 is that short isoforms are now only kept if they are more abundant than the most expressed superset.

In my test case this solves the problem.

Jeltje commented 2 years ago

Additional info: If you do know the correct start positions of your transcripts, it's best to run flair collapse with --promoters, which is a bed file of start positions. This will remove any reads that are not full length before the collapse step, ensuring the inputs are 5' complete.

vicasze commented 2 years ago

This solves the issue. Thank you very much for the support and explanation!

BrooksLabUCSC / flair

Truncated, non-full length reads #210