BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
201 stars 69 forks source link

Unexpected entries in gtf file after flair collapse #305

Closed juanfraitu1 closed 7 months ago

juanfraitu1 commented 8 months ago

I have read the paper (https://doi.org/10.1038/s41467-020-15171-6)
and the manual (https://flair.readthedocs.io/en/latest/) and I still have a question about

I am using flair to analyze transcripts from a high throughput study of the human brain cortex. The generated fastq files are unusually big (~120gb). The files were previously mapped, producing a .bam file, that was converted to .bed with the provided script.

Afterwards, this bed file was passed by flair correct. In the end, I ended up with a ~5gb corrected bed file. According to the directions of the program, this file would be too big for flair collapse, so it was broken into 24 files, one for each chromosome (these were much more manageable and none of them was more than a couple hundred MBs). The extra parameters like stringent, check splice, etc., were necessary to replicate a previous analysis from other group.

The command I used for flair collapse was: flair collapse -g -q --reads --output chr3 --gtf --annotation_reliant generate --check_splice --stringent

In the end, for every chromosome a gtf file is generated. Here is my issue:

These are the first 3 entries of for example chr3: chr1 FLAIR transcript 629062 629433 . + . gene_id "ENSG00000225972"; transcript_id "ENST00000416931"; chr1 FLAIR exon 629062 629433 . + . gene_id "ENSG00000225972"; transcript_id "ENST00000416931"; exon_number "0"; chr1 FLAIR transcript 8786211 8786913 . - . gene_id "ENSG00000224315"; transcript_id "ENST00000428803";

And these are the first entries of other example chromosome (chrX), and are found in all other chromosomes as well: chr1 FLAIR transcript 629062 629433 . + . gene_id "ENSG00000225972"; transcript_id "ENST00000416931"; chr1 FLAIR exon 629062 629433 . + . gene_id "ENSG00000225972"; transcript_id "ENST00000416931"; exon_number "0"; chr1 FLAIR transcript 8786211 8786913 . - . gene_id "ENSG00000224315"; transcript_id "ENST00000428803";

I am puzzled by this, my command was very explicit in indicating that I only wanted isoforms for one chromosome, why are there other entries for all other chromosomes and why are they all the same in every case?

Thanks in advance!

Jeltje commented 7 months ago

This may be clear already, but just to be sure: --output is just the name base for your output file, it doesn't mean 'output only chr3'

The problem is --annotation_reliant generate. This asks Flair to create all transcripts in the input gtf, no matter if they're supported or not. So if you want to have per chromosome outputs you'll also need to split the gtf input file. Alternatively you could create an input fasta with all annotated transcripts for your chromosome of interest and run it with --annotation_reliant chr3.transcripts.fa but that's probably more cumbersome than splitting the gtf.

I agree that this is not ideal behavior. We're working on parallelizing flair collapse.

If this does not answer your question please reopen this ticket.