Not full length transcript

yycc9897 commented 6 months ago

Copy and paste the exact command you tried to run

flair align --reads $ld1,$ld2,$ld3,$ld4,$sol1,$sol2,$sol3,$sol4 --genome $ref --output $output_dir --threads 10
flair correct --query $input_dir/01align.bed --shortread $shortread --genome $ref_fa --gtf $gtf --output $output_dir --threads 8
flair collapse -g $ref --gtf $gtf -q $all_bed --reads $fa --stringent --check_splice --filter nosubset  -o $output_dir/merged_collapse --threads 8

How did you install Flair?

bioconda FLAIR v2.0.0

What happened? foxo1

After I ran the flair collapse step, I checked the bed file in igv and found that there were many truncated, non-full-length transcripts. What is the reason for this? I've tried --stringent and --filter nosubset, but the problem doesn't resolve. I first identified full-length transcripts using pychopper and then ran flair.

What else do we need to know? Splice sites were extracted from short-read data using star

Jeltje commented 6 months ago

Flair tries to find different isoforms of a gene, which means that it tends to have trouble with single exon transcripts that do not overlap exons of known genes. One way to deal with this is to use --annotation_reliant. This restricts Flair to only genes present in the input gtf. Another method is increasing the minimum number of supporting reads, --support. You have a lot of input files, which indicates there are many reads. The default setting is 3 reads per isoform, try increasing that to 10. Lastly you could just filter out all single exon genes. Flair isn't really meant to find novel single exon genes; these transcripts are just reported to avoid losing information.

If this doesn't work, please comment again. Otherwise please close this ticket. Thanks for using Flair!

Jeltje commented 6 months ago

Adding: The reported transcripts that overlap the leftmost exon of your multiexon transcripts are likely on the opposite strand.

BrooksLabUCSC / flair

Not full length transcript #315