BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
205 stars 71 forks source link

Flair align & correct producing empty results & no error #350

Open jrkirk61 opened 1 month ago

jrkirk61 commented 1 month ago

Copy and paste the exact command you tried to run

First I tried running minimap2, followed by flair correct:

module load minimap/2.28-IGB-gcc-8.2.0
module load SAMtools/1.17-IGB-gcc-8.2.0

minimap2 -ax splice -uf -k14 \
  --junc-bed data/reference/Amel_HAv3.1_virus_allBiotypes_GeneID.bed \
  data/reference/GCF_003254395.2_Amel_HAv3.1_virus_genomic.fna \
  results/concatenated_fastq/2024-05-06/hb4_all.fastq | \
  samtools sort -O bam - > results/minimap/2024-07-12/hb4_all_sorted.bam

samtools index results/minimap/2024-07-12/hb4_all_sorted.bam

module purge
module load flair/2.0.0

bam2Bed12 -i results/minimap/2024-07-12/hb4_all_sorted.bam > \
results/flair/2024-07-12/hb4_all_sorted.bed 

flair correct -q results/flair/2024-07-12/hb4_all_sorted.bed \
-g data/reference/GCF_003254395.2_Amel_HAv3.1_virus_genomic.fna \
-f data/reference/Amel_HAv3.1_virus_allBiotypes_GeneID.gtf \
-o results/flair/2024-07-12/hb4

Then I tried flair align instead of minimap2 directly:

flair align --nvrna -g data/reference/GCF_003254395.2_Amel_HAv3.1_virus_genomic.fna \
-r results/concatenated_fastq/2024-05-06/hb4_all.fastq \
--junction_bed data/reference/Amel_HAv3.1_virus_allBiotypes_GeneID.bed \
--output results/minimap/2024-07-12/hb4_all_sorted_flairtest

How did you install Flair?

  1. bioconda (e.g. conda create -n flair -c conda-forge -c bioconda flair)

What happened?

I attempted the first set of code on 12 different libraries using a job script, so all the code is identical. It ran as expected for 11 of the libraries. For this particular library (hb4) minimap2 produced 200k+ alignments as expected, but then flair correct quit after seconds with an empty output file and no error messages:

Step 1/5: Splitting junctions from GTF by chromosome: 100%|█████████████████████| 28373/28373 [00:00<00:00, 138899.25it/s]
Step 3/5: Preparing annotated junctions to use for correction: 100%|█████████████████████| 68/68 [00:00<00:00, 461.24it/s]
Step 4/5: Preparing reads for correction: 0it [00:00, ?it/s]n:  31%|██████▍              | 21/68 [00:00<00:00, 195.63it/s]
Step 5/5: Correcting Splice Sites: 0it [00:00, ?it/s] ?it/s]
Step 5/5: Correcting Splice Sites: 0it [00:00, ?it/s]

This is rather bizarre to me considering that it ran okay for all other 11 libraries with some output. I tried this multiple times to ensure it wasn't just a coincidence, and every time it produced an empty output file.

When I ran flair align first instead of minimap2, it finished and output a nearly empty BAM file (2774 bytes) that I can't read with samtools. The BED file is empty and there are no errors (see log output below). Now this really doesn't make sense. Why would minimap2 run fine, and flair align not run?

[M::mm_idx_gen::4.549*1.44] collected minimizers
[M::mm_idx_gen::7.715*1.64] sorted minimizers
[M::main::7.715*1.64] loaded/built the index for 184 target sequence(s)
[M::mm_mapopt_update::8.233*1.60] mid_occ = 317
[M::mm_idx_stat] kmer size: 14; skip: 5; is_hpc: 0; #seq: 184
[M::mm_idx_stat::8.581*1.57] distinct minimizers: 26295533 (58.35% are singletons); average occurrences: 2.893; average spacing: 2.962; total length: 225304937
[M::worker_pipeline::22.729*1.77] mapped 207848 sequences
[M::main] Version: 2.28-r1209
[M::main] CMD: minimap2 -ax splice --secondary=no --junc-bed data/reference/Amel_HAv3.1_virus_allBiotypes_GeneID.bed -uf -k14 -t 4 data/reference/GCF_003254395.2_Amel_HAv3.1_virus_genomic.fna results/concatenated_fastq/2024-05-06/hb4_all.fastq
[M::main] Real time: 22.747 sec; CPU: 40.345 sec; Peak RSS: 2.923 GB

What else do we need to know?

What possible reasons could there be for the results I'm seeing? Obviously there must be some difference in this library compared to the others, but it can't be a detrimental difference because minimap2 was able to run an alignment without any problem. But flair align, which utilizes minimap, came back with nothing. And flair correct returned nothing when I provided it minimap's bam file.

jrkirk61 commented 1 month ago

Update: It turns out that all 200k reads did not align to the genome, which was a very unexpected result. It seems that minimap2 saves everything including unaligned reads so it's hard to tell this based on the size of the BAM file alone. On the other hand flair align outputs nothing which is even more confusing. It would be helpful to have a summary at the end of flair align about how many reads were uniquely mapped, multi-mapped, unaligned, etc.