FenyoLab / L1EM

Estimate locus specific human LINE-1 expression.
Other
32 stars 9 forks source link

Error: bwa_seq_open] fail to open file '*.fq1.*' : No such file or directory #15

Open alfonsosaera opened 1 month ago

alfonsosaera commented 1 month ago

Hi,

Thank you for developing this tool for evaluating L1 expression. I'm encountering an issue while running L1EM from a conda environment. Here is my setup and the steps I followed:

Environment and Command

I activated the conda environment and executed the following commands:

conda activate L1EM
GENOMEpath='/L1EM_results/reference_genome'
bash generate_L1EM_fasta_and_index.sh ${GENOMEpath}/hg38.fa

BAMpath='/Results/star'
L1EMpath='/L1EM'

cd /L1EM_results/L1_quantification/Cyto_100 # new folder for this BAM file
bash -e ${L1EMpath}/run_L1EM.sh ${BAMpath}/Cyto_100_trimmed_Aligned.sortedByCoord.out.bam ${L1EMpath} ${GENOMEpath}/hg38.fa

Issue

The process starts but fails with the following error:

STEP 1: realign
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 0 reads
[main] Version: 0.7.17-r1188
[main] CMD: /mnt/disks/mapping_quant/miniconda3/envs/L1EM/bin/bwa aln -k 3 -n 3 -t 16 -i 20 /mnt/disks/mapping_quant/TASK1_XD446-146/L1EM_results/reference_genome/hg38.fa unaligned.fq1
[main] Real time: 1.288 sec; CPU: 1.287 sec
[main] Version: 0.7.17-r1188
[main] CMD: /mnt/disks/mapping_quant/miniconda3/envs/L1EM/bin/bwa aln -k 3 -n 3 -t 16 -i 20 /mnt/disks/mapping_quant/TASK1_XD446-146/L1EM_results/reference_genome/hg38.fa unaligned.fq2
[main] Real time: 1.368 sec; CPU: 1.253 sec
[main] Version: 0.7.17-r1188
[main] CMD: /mnt/disks/mapping_quant/miniconda3/envs/L1EM/bin/bwa sampe /mnt/disks/mapping_quant/TASK1_XD446-146/L1EM_results/reference_genome/hg38.fa 1.sai 2.sai unaligned.fq1 unaligned.fq2
[main] Real time: 0.001 sec; CPU: 0.001 sec
STEP 2: extract
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 0 reads
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 0 reads
STEP 3: candidate alignments
[bwa_seq_open] fail to open file '*.fq1.*' : No such file or directory

It appears that there is an issue reading the BAM file, as no reads were processed in STEP 1, which is reflected in the log:

[M::bam2fq_mainloop] processed 0 reads

Additional Information

As can be deduced from the name of the BAM file, reads were trimmed before the mapping with STAR.

Here is the output from samtools flagstat for the BAM file, in case it's useful for debugging:

samtools flagstat ${BAMpath}/Cyto_100_trimmed_Aligned.sortedByCoord.out.bam
78473240 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
78473240 + 0 mapped (100.00% : N/A)
78473240 + 0 paired in sequencing
39236620 + 0 read1
39236620 + 0 read2
78473240 + 0 properly paired (100.00% : N/A)
78473240 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Any help or advice would be highly appreciated.

Thanks,

Alfonso

alfonsosaera commented 1 month ago

Hi, I come back with an update.

I tried samtools bam2fq on the same BAM file and it works well.

This is the command

samtools bam2fq -1 paired1.fq -2 paired2.fq -0 read_other.fq -s singleton.fq -n -F 0x900 ${BAMpath}/Cyto_100_trimmed_Aligned.sortedByCoord.out.bam

the terminal messages of samtools

[M::bam2fq_mainloop] discarded 75502250 singletons
[M::bam2fq_mainloop] processed 78473240 reads

and the content of the fastq files, just one example:

head paired1.fq
@LH00169:289:223VVWLT4:3:1234:17039:21760
CAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@LH00169:289:223VVWLT4:3:2305:34267:19840
TGTCATGGAGCCCCCTACGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGTCCTTTCCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIII

Thanks in advance for your help!

Alfonso

MarkGrivainis commented 1 month ago

Hello,

The first stage of the first step, $samtools view -@ $threads -b -F 2 $bamfile, attempts to extract all the unaligned reads from the bam file.

As all the reads in your bam file are aligned correctly, the unaligned files are empty. 78473240 + 0 properly paired (100.00% : N/A)

L1EM uses the unaligned reads as part of its analysis. Could unaligned reads have been filtered out when your bam file was created?

ekaparos commented 1 month ago

Additionally, L1EM requires paired end read alignments to hg38. If your reads are single end, then you will run into an error in this part of step 1:

$bwa aln -k $realignNM -n $realignNM -t $threads -i $bwa_i $hg38 unaligned.fq1 > 1.sai
$bwa aln -k $realignNM -n $realignNM -t $threads -i $bwa_i $hg38 unaligned.fq2 > 2.sai
alfonsosaera commented 1 month ago

Hi @MarkGrivainis and @ekaparos , thanks for your answers!

I aligned my reads with STAR, how should I proceed to generate a BAM suitable for L1EM? I cannot find this information. My reads are paired end and aligned to the hg38.

Thanks for your help!

Alfonso

MarkGrivainis commented 1 month ago

Hi @alfonsosaera,

I do not have any experience with STAR. However, after looking into the issue, I think STAR discards unmapped reads by default. When aligning the reads, you need to set the --outSAMunmapped Within flag to store them in the resulting SAM file.

alfonsosaera commented 1 month ago

Hi @MarkGrivainis ,

I made it work! Now L1EM analyzed the new BAM with unmapped reads without errors. Thanks for your help!

One of the tree outputs is empty, is this normal?

==> filter_L1HS_FPM.txt <==
family.category.locus.strand    only    3prunon

==> full_counts.txt <==
family.category.locus.strand    only    3prunon passive_sense   passive_antisense   antisense
L1PA3.0.chr1:46504027-46504947.-    0.0 0.0 2.0262262054950715  0.0 0.0
L1HS.0.chr8:29120166-29121108.- 0.0 0.0 0.0 3.9653876043043508  0.0
L1PA6.0.chr3:130315158-130315424.-  0.0 0.0 0.999804886384662   0.0 0.0
L1PA3.0.chr16:49134096-49136751.-   0.0 0.0 1.368642759639711   0.0 0.0
L1HS.0.chr10:13217375-13218700.+    0.0 0.0 0.0 2.2467241734077064  0.0
L1HS.0.chr6:124839973-124841353.-   0.0 0.0 2.1766472804965966  0.0 0.0
L1PA2.0.chr4:161973830-161975754.+  0.0 0.0 0.0 2.3828525735526958  0.0
L1HS.0.chr15:55986934-55988017.-    0.0 0.0 4.341820378554968   0.0 0.0
L1PA3.0.chr9:119133669-119135888.+  0.0 0.0 1.1751923595072473  0.0 0.0

==> l1hs_transcript_counts.txt <==
family.category.locus.strand    only    3prunon
L1HS.1.chr6:24811657-24817706.- 14.25756758085305   0.0
L1HS.1.chr2:4733729-4739760.+   5.326441010932766   0.0
L1HS.1.chr2:129415687-129417400.-   5.184223195295114   0.0
L1HS.1.chr4:98592435-98598463.- 1.3552260988037228  0.0
L1HS.1.chrX:11935296-11941314.- 4.675799369376466   0.0

Thanks for your time and suggestions!

Alfonso

ekaparos commented 1 month ago
  1. full_counts.txt: raw count estimates for each L1HS/L1PA* element with any aligned read pairs
  2. l1hs_transcript_counts.txt: expression estimates for L1HS elements, reported as raw counts
  3. filter_L1HS_FPM.txt: L1HS whose expression is supported by at least 100 read pairs, reported as FPM (read pairs per million properly aligned).

For full_counts.txt each of the five transcript types: only, runon, passive (sense), passive (antisense), antisense are reported.

For l1hs_transcript_counts.txt and filter_L1HS_FPM.txt only proper transcription from L1HS elements start at the 5' UTR is reported.

It is possible that filter_L1HS_FPM.txt is empty because there were not enough read pairs (less than 100) to support normalized “proper” (i.e. “only” + “3prunon”) LINE-1 expression from the loci in the active subfamily (L1HS).

What is your model organism/disease context?