Closed mpizzagalli777 closed 2 weeks ago
I think the issue here is the formatting of the fastq read names, as you suggested. There's a couple possible solutions I can think of - one is to go through your fastq files and rename the reads, removing the portion in front of the | symbol (131:1102|ac8f6919-21ba-45b7-b897-0eadf61c47ae -> ac8f6919-21ba-45b7-b897-0eadf61c47ae). After that, you would need to rerun all steps of FLAIR. If that doesn't fix it, you can also try combining your fastq files with cat in the same way you combined your bed files. If neither of these work, please send the head of both fastq files you're using and I can use that to troubleshoot this issue further.
Hi, thank you for the help! This did appear to fix the issue. For the future, should the names always be relatively simple?
Copy and paste the exact command you tried to run
How did you install Flair?
conda env create -f misc/flair_basic_conda_env.yaml
What happened?
We know it's ugly but we promise it helps us solve problems faster.
What else do we need to know?
I ran the following as was suggested in #304 /users/mpizzaga/.conda/envs/flair_basic_conda_env/bin/python /oscar/home/mpizzaga/.conda/envs/flair_basic_conda_env/lib/python3.6/site-packages/flair/subset_unassigned_reads.py /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.annotated_transcripts.isoform.read.map.txt /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.bed 3.0 /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_combined_corrected.unassigned.bed /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_dirmRNAseq_basecall_2_py.fastq /users/mpizzaga/data/mpizzaga/Fusion_Transcript_Analysis/GB2/ONT/alignments/GB2_p22_cDNA_basecall_py.fastq > debugging.txt
This was the only output into the terminal:
405548 names do not match any names in fastq file(s)e.g. 55fa8982-f3bf-41c7-abcb-dcee67b84017 in bed but not in fastq
And this is what debugging contains (I cut down the sequences for space):
tail debugging.txt
This is the structure of the GB2_combined_corrected.bed file:
head GB2_combined_corrected.bed chr1 16336 18023 131:1102|ac8f6919-21ba-45b7-b897-0eadf61c47ae 10 - 16336 18023 217,95,2 4 429,198,137,109, 0,521,1269,1578, chr1 16441 24861 146:1252|2d2c9f8d-45a3-4830-afd6-8ddd33e58a8a 13 - 16441 24861 217,95,2 6 324,198,137,147,99,124, 0,416,1164,1473,1826,8296, chr1 16445 24891 139:1232|04a2ed7e-0750-46ba-8b8a-cf24886b6399 1 - 16445 24891 217,95,2 6 320,198,137,147,99,154, 0,412,1160,1469,1822,8292, chr1 184919 188899 135:1725|f6e873fb-e34e-420e-ac17-474a2bdd1965 60 - 184919 188899 217,95,2 9 431,69,153,159,202,136,137,146,109, 0,571,1397,2209,2456,2835,3210,3519,3871, chr1 184919 195413 122:1873|36db1ce5-0633-46a7-a3d2-51b8324c08b1 18 - 184919 195413 217,95,2 10 431,69,153,159,202,136,137,146,112,151, 0,571,1397,2209,2456,2835,3210,3519,3871,10343,
A possibly related question I have is why the reads seem to use such strange names. I performed base calling using Dorado --emit-fastq and since flair uses these as the isoform name, they have become very clunky. Is there a way around this?
Since in #221 it is said to be highly suggested to use --annotation_reliant generate, I would like to continue to use this but I tried running it once without it and collapse was able to run successfully.
For context, I am trying to generate a transcriptome that includes novel transcripts as a way to identify fusion transcripts in a cancer cell line