Issue with flair collapse subprocess subset_unassigned_reads.py

mpizzagalli777 commented 1 month ago

Copy and paste the exact command you tried to run

conda activate pychopper_env

pychopper -k PCS111 -t 20 $lr_dir/${sample}.fastq $lr_dir/${sample}_py.fastq

conda deactivate

# Activate flair environment and run flair align and correct
conda activate flair_basic_conda_env

flair align -g $genome/GRCh38.p14.genome.fa \
        -r $lr_dir/${sample}_py.fastq \
        --output $lr_dir/${sample}_py_flair --threads 20

flair correct -q $lr_dir/${sample}_py_flair.bed \
        -g $genome/GRCh38.p14.genome.fa \
        --gtf $genome/gencode.v44.annotation.gtf \
        --output $lr_dir/${sample}_py_flair --threads 20

flair align -g $genome/GRCh38.p14.genome.fa \
    -r $lr_dir/GB2_dirmRNAseq_basecall_2.fastq \
    --nvrna --output $lr_dir/GB2_dirmRNAseq_basecall_2_py_flair --threads 20

flair correct -q $lr_dir/GB2_dirmRNAseq_basecall_2_py_flair.bed \
    -g $genome/GRCh38.p14.genome.fa \
    --gtf $genome/gencode.v44.annotation.gtf \
    --nvrna --output $lr_dir/GB2_dirmRNAseq_basecall_2_py_flair --threads 20

conda deactivate

cat $lr_dir/GB2_p22_cDNA_basecall_py_flair_all_corrected.bed $lr_dir/GB2_dirmRNAseq_basecall_2_py_flair_all_corrected.bed > $lr_dir/GB2_combined_corrected.bed

echo "Starting flair collapse of the genome"

conda activate flair_basic_conda_env

flair collapse -g $genome/GRCh38.p14.genome.fa \
    --gtf $genome/gencode.v44.annotation.gtf \
    -q $lr_dir/GB2_combined_corrected.bed \
    -r $lr_dir/GB2_dirmRNAseq_basecall_2_py.fastq,$lr_dir/GB2_p22_cDNA_basecall_py.fastq \
    --stringent --check_splice --generate_map --annotation_reliant generate --threads 20 \
    --output $lr_dir/GB2_combined_corrected_collapse

conda deactivate

How did you install Flair? conda env create -f misc/flair_basic_conda_env.yaml

What happened?

Starting flair collapse of the genome
405548 names do not match any names in fastq file(s)e.g. 34ccd2ec-c709-470b-a060-f397cfbadff0 in bed but not in fastq
Writing temporary files to /tmp/tmp8ltiq19n/    
Making transcript fasta using annotated gtf and genome sequence
Aligning reads to reference transcripts
Counting supporting reads for annotated transcripts
Setting up unassigned reads for flair-collapse novel isoform detection
Traceback (most recent call last):
  File "/users/mpizzaga/.conda/envs/flair_basic_conda_env/bin/flair", line 8, in <module>
    sys.exit(main())
  File "/users/mpizzaga/.conda/envs/flair_basic_conda_env/lib/python3.6/site-packages/flair/flair.py", line 1035, in main
    status = collapse()
  File "/users/mpizzaga/.conda/envs/flair_basic_conda_env/lib/python3.6/site-packages/flair/flair.py", line 561, in collapse
    precollapse, str(min_reads), args.o+'unassigned.bed']+args.r, stdout=open(subset_reads, 'w'))
  File "/users/mpizzaga/.conda/envs/flair_basic_conda_env/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/users/mpizzaga/.conda/envs/flair_basic_conda_env/bin/python', '/oscar/home/mpizzaga/.conda/envs/flair_basic_conda_env/lib/python3.6/site-packages/flair/subset_unassigned_reads.py', '/users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.annotated_transcripts.isoform.read.map.txt', '/users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.bed', '3.0', '/users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.unassigned.bed', '/users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_dirmRNAseq_basecall_2_py.fastq', '/users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_p22_cDNA_basecall_py.fastq']' returned non-zero exit status 1.

We know it's ugly but we promise it helps us solve problems faster.

What else do we need to know?

I ran the following as was suggested in #304 /users/mpizzaga/.conda/envs/flair_basic_conda_env/bin/python /oscar/home/mpizzaga/.conda/envs/flair_basic_conda_env/lib/python3.6/site-packages/flair/subset_unassigned_reads.py /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.annotated_transcripts.isoform.read.map.txt /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.bed 3.0 /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.unassigned.bed /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_dirmRNAseq_basecall_2_py.fastq /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_p22_cDNA_basecall_py.fastq > debugging.txt

This was the only output into the terminal:

405548 names do not match any names in fastq file(s)e.g. 55fa8982-f3bf-41c7-abcb-dcee67b84017 in bed but not in fastq

And this is what debugging contains (I cut down the sequences for space):

tail debugging.txt

113:681|fcce66d5-9112-4385-a7b6-646cb80e191f AGCACCATGGACAACCCCACCACCACCCAGTATGCCAGCCTCATGCACAG... 127:636|ff5a1e0b-2bf4-4e61-872b-a2960cf314ed TTGGGATCCGGCAAGATGGCAGAAGTAGAGCAGAAGAAGAAGCGGA... 136:1831|fd2f70a2-435e-4a62-94c5-9c7382b2a2a9 GGGCGGCGTCGCCAGGAGGAGCGCGCGGGCACAGGGTGCCGCT....

This is the structure of the GB2_combined_corrected.bed file:

head GB2_combined_corrected.bed chr1 16336 18023 131:1102|ac8f6919-21ba-45b7-b897-0eadf61c47ae 10 - 16336 18023 217,95,2 4 429,198,137,109, 0,521,1269,1578, chr1 16441 24861 146:1252|2d2c9f8d-45a3-4830-afd6-8ddd33e58a8a 13 - 16441 24861 217,95,2 6 324,198,137,147,99,124, 0,416,1164,1473,1826,8296, chr1 16445 24891 139:1232|04a2ed7e-0750-46ba-8b8a-cf24886b6399 1 - 16445 24891 217,95,2 6 320,198,137,147,99,154, 0,412,1160,1469,1822,8292, chr1 184919 188899 135:1725|f6e873fb-e34e-420e-ac17-474a2bdd1965 60 - 184919 188899 217,95,2 9 431,69,153,159,202,136,137,146,109, 0,571,1397,2209,2456,2835,3210,3519,3871, chr1 184919 195413 122:1873|36db1ce5-0633-46a7-a3d2-51b8324c08b1 18 - 184919 195413 217,95,2 10 431,69,153,159,202,136,137,146,112,151, 0,571,1397,2209,2456,2835,3210,3519,3871,10343,

A possibly related question I have is why the reads seem to use such strange names. I performed base calling using Dorado --emit-fastq and since flair uses these as the isoform name, they have become very clunky. Is there a way around this?

Since in #221 it is said to be highly suggested to use --annotation_reliant generate, I would like to continue to use this but I tried running it once without it and collapse was able to run successfully.

For context, I am trying to generate a transcriptome that includes novel transcripts as a way to identify fusion transcripts in a cancer cell line

cafelton commented 4 weeks ago

I think the issue here is the formatting of the fastq read names, as you suggested. There's a couple possible solutions I can think of - one is to go through your fastq files and rename the reads, removing the portion in front of the | symbol (131:1102|ac8f6919-21ba-45b7-b897-0eadf61c47ae -> ac8f6919-21ba-45b7-b897-0eadf61c47ae). After that, you would need to rerun all steps of FLAIR. If that doesn't fix it, you can also try combining your fastq files with cat in the same way you combined your bed files. If neither of these work, please send the head of both fastq files you're using and I can use that to troubleshoot this issue further.

mpizzagalli777 commented 2 weeks ago

Hi, thank you for the help! This did appear to fix the issue. For the future, should the names always be relatively simple?

BrooksLabUCSC / flair

Issue with flair collapse subprocess subset_unassigned_reads.py #375