BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
208 stars 71 forks source link

Exon with negative length #356

Open IceFreez3r opened 2 months ago

IceFreez3r commented 2 months ago

Copy and paste the exact command you tried to run

flair collapse --query results/flair/correct/lung_all_corrected.bed --genome resources/reference.fa --reads /project/hfa_work/ENCODE/data/reads/ENCFF552NVU.fastq.gz /project/hfa_work/ENCODE/data/reads/ENCFF934MBW.fastq.gz /project/hfa_work/ENCODE/data/reads/ENCFF341BSQ.fastq.gz /project/hfa_work/ENCODE/data/reads/ENCFF250IWT.fastq.gz --gtf resources/annotation.gtf --threads {threads} --output results/flair/collapse/lung

where the reference.fa and annotation.gtf are both from GENCODE release v46. Data files are publically available on ENCODE (ENCODE cart).

How did you install Flair? bioconda with Snakemake 8.16, environment has just FLAIR:

name: flair
channels:
 - bioconda
 - conda-forge
dependencies:
  - flair

What happened? Output gtf file threw an error when I tried to index it with tabix after sorting and compressing it. Turns out the gtf has exons with length 0 and -1.

chr17   FLAIR   transcript  82442644    82449291    .   -   .   gene_id "ENSG00000178927.19"; transcript_id "m54284U_200415_060704/157550641/ccs";
chr17   FLAIR   exon    82442644    82444124    .   -   .   gene_id "ENSG00000178927.19"; transcript_id "m54284U_200415_060704/157550641/ccs"; exon_number "0";
chr17   FLAIR   exon    82444447    82444591    .   -   .   gene_id "ENSG00000178927.19"; transcript_id "m54284U_200415_060704/157550641/ccs"; exon_number "1";
chr17   FLAIR   exon    82445864    82445863    .   -   .   gene_id "ENSG00000178927.19"; transcript_id "m54284U_200415_060704/157550641/ccs"; exon_number "2";
chr17   FLAIR   exon    82449293    82449291    .   -   .   gene_id "ENSG00000178927.19"; transcript_id "m54284U_200415_060704/157550641/ccs"; exon_number "3";

Start of the last two exons are larger than their ends.

The follow up commands, that revealed the error (snakemake syntax, but it should be intuitive to understand): Sorting and compression

(grep -v "^#" {input} | sort -k1,1 -k4,4n | bgzip -c > {output}) > {log} 2>&1

Tabix

tabix -p gff {input} > {log} 2>&1

What else do we need to know? I ran the same analysis on reads from five other tissues and had no issues there.

IceFreez3r commented 1 week ago

After fixing the exons to be at least length 1 I found 0-length exons in 4 of the other tools.