FenyoLab / L1EM

Estimate locus specific human LINE-1 expression.
Other
32 stars 9 forks source link

Error running L1EM #4

Closed hchetia closed 2 years ago

hchetia commented 3 years ago

Hi I am following the steps to run L1EM exactly as suggested in your manual (provided absolute paths for all data) THE CMD: for i in *bam; do bash /L1EM/run_L1EM.sh /l1em/"${i}" /L1EM /l1em/L1EM_hg38_reference/GRCh38.primary_assembly.genome.fa; done;

I get the foll. error: THE ERROR LOG: STEP 1: realign [E::hts_open_format] Failed to open file 4239.bam samtools view: failed to open "4239.bam" for reading: No such file or directory [M::bam2fq_mainloop] discarded 0 singletons [M::bam2fq_mainloop] processed 0 reads [main] Version: 0.7.17-r1188 [main] CMD: L1EM/bin/bwa aln -k 3 -n 3 -t 16 -i 20 /l1em/L1EM_hg38_reference/GRCh38.primary_assembly.genome.fa unaligned.fq1 [main] Real time: 2.559 sec; CPU: 2.467 sec [main] Version: 0.7.17-r1188 [main] CMD: L1EM/bin/bwa aln -k 3 -n 3 -t 16 -i 20 /l1em/L1EM_hg38_reference/GRCh38.primary_assembly.genome.fa unaligned.fq2 [main] Real time: 2.678 sec; CPU: 2.671 sec [main] Version: 0.7.17-r1188 [main] CMD: L1EM/bin/bwa sampe /l1em/L1EM_hg38_reference/GRCh38.primary_assembly.genome.fa 1.sai 2.sai unaligned.fq1 unaligned.fq2 [main] Real time: 0.003 sec; CPU: 0.002 sec STEP 2: extract [E::hts_open_format] Failed to open file 4239.bam Traceback (most recent call last): File "/L1EM/utilities/read_or_pair_overlap_bed.py", line 73, in main() File "/L1EM/utilities/read_or_pair_overlap_bed.py", line 39, in main inbam = pysam.AlignmentFile(bamfile,'rb') File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 933, in pysam.libcalignmentfile.AlignmentFile._open IOError: [Errno 2] could not open alignment file 4239.bam: No such file or directory [E::hts_open_format] Failed to open file temp.bam samtools sort: can't open "temp.bam": No such file or directory [M::bam2fq_mainloop] discarded 0 singletons [M::bam2fq_mainloop] processed 0 reads [M::bam2fq_mainloop] discarded 0 singletons [M::bam2fq_mainloop] processed 0 reads STEP 3: candidate alignments [bwa_aln] fail to locate the index [bwa_aln] fail to locate the index [bwa_sai2sam_pe] fail to locate the index STEP 4: G(R) matrix construction [E::hts_open_format] Failed to open file 4239.bam Traceback (most recent call last): File "/L1EM/CGC/median_template_and_pairs.py", line 34, in for read in pysam.AlignmentFile(bamfile): File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 933, in pysam.libcalignmentfile.AlignmentFile._open IOError: [Errno 2] could not open alignment file 4239.bam: No such file or directory usage: G_of_R.py [-h] -b BAMFILE [-e ERROR_PROB] [-m MAX_START2START_LEN] [-r READS_PER_PICKLE] [-p PREFIX] [-n NMDIFF] -i INSERT_MEAN [--flanking FLANKING] [--as_start AS_START] [-w WIGGLE] [--min_len MIN_LEN] [--min_exon_len MIN_EXON_LEN] G_of_R.py: error: argument -i/--insert_mean: expected one argument STEP 5: Expectation maximization ls: cannot access ../G_of_R/pk2: No such file or directory ls: cannot access ../G_of_R/TE_list.txt: No such file or directory cp: missing destination file operand after ‘TE_list.txt’ Try 'cp --help' for more information. Traceback (most recent call last): File "/L1EM/L1EM/L1EM.py", line 168, in main() File "/L1EM/L1EM.py", line 117, in main for name in open(TE_list): IOError: [Errno 2] No such file or directory: 'TE_list.txt' STEP 6: Writing results Traceback (most recent call last): File "/L1EM/utilities/report_l1_exp_counts.py", line 29, in X_est = dict(zip(pickle.load(open('names_final.pkl')),pickle.load(open('X_final.pkl')))) IOError: [Errno 2] No such file or directory: 'names_final.pkl' Traceback (most recent call last): File "/L1EM/utilities/report_l1hs_transcription.py", line 29, in X_est = dict(zip(pickle.load(open('names_final.pkl')),pickle.load(open('X_final.pkl')))) IOError: [Errno 2] No such file or directory: 'names_final.pkl' Traceback (most recent call last): File "/L1EM/utilities/filtered_and_normalized_l1hs.py", line 25, in X_est = dict(zip(pickle.load(open(sys.argv[1])),pickle.load(open(sys.argv[2])))) IOError: [Errno 2] No such file or directory: 'names_final.pkl' STEP 7: Clean up cp: cannot stat ‘*final.pkl’: No such file or directory

wmckerrow commented 3 years ago

I'm actually a little confused about why this is happening. It seems like despite your specifying the full path, only the bam file name made it into the shell script. If you just specify one bam with absolute path rather than using the for loop, do you get the same result? If you're using data that is not strand specific you should use the run_L1EM_unstranded.sh script instead. It also looks like your for loop will run L1EM for each bam from the same directory. Each time you run should be in a different directory. Otherwise you will overwrite the results of the previous run with the next run. However neither of these could cause the error your finding.

hchetia commented 3 years ago

@wmckerrow Direct runs worked great till the end. But it would have been incredibly easier to be able to implement it in a shell script. Is there a way to modify the tool to allow users to specify an outdir to create files within for each bam file?

I am using stranded data at the moment and actually was using an additional "mkdir & cd" loop per bam file.

wmckerrow commented 3 years ago

Can you add an echo between the do and the bash in your for loop and then me know what the exact commands are that are being executed?

I haven't tried exactly what you're doing, but I have used a for loop to write the L1EM commands into a bunch of shell scripts that I then run. I know that works.

hchetia commented 3 years ago

Hi @wmckerrow , Before I run the for loop with echo, I would like to share the log of the successful run which has some warnings. Could you tell me if the run was really successful? File attached.

nohup.txt

wmckerrow commented 3 years ago

It looks like it failed to find the hg38 index in step 1. I'd recommend running with the bash -e option (i.e. bash -e run_L1EM.sh etc) to make it stops as soon as it hits the first error. To diagnose what went wrong I'd need the command you entered as well as the results of:

ls /fullpathto/bamfile.bam ls /fullpathto/hg38.fa

hchetia commented 3 years ago

Thanks for the suggestion. Sharing the ls results below- ls /fullpathto/bamfile.bam*

Sorted_ID_4239*.bam Sorted_ID_4239*.bam.bai 4239 (the Working directory)

ls /fullpathto/L1EM_hg38_reference/hg38.fa* L1EM_hg38_reference/hg38.fa L1EM_hg38_reference/hg38.fa.ann L1EM_hg38_reference/hg38.fa.gz L1EM_hg38_reference/hg38.fa.sa L1EM_hg38_reference/hg38.fa.amb L1EM_hg38_reference/hg38.fa.bwt L1EM_hg38_reference/hg38.fa.pac

wmckerrow commented 3 years ago

Is there a "" in bam file name? Because is recognized as a wildcard, including it in the bam file name could be causing errors.

hchetia commented 3 years ago

No There is no * being used in the filename or the cmd used. I just put it here because it is a long name.

wmckerrow commented 3 years ago

Can you try it again with this shell script? I added some pre-checks: bash -e run_L1EM_diagnostic.txt etc

run_L1EM_diagnostic.txt

hchetia commented 3 years ago

Hi @wmckerrow thanks for the diagnostic script. I just figured out that I was making an error from my end in specifying the genome filename. Now the index related error stopped. Here's the log of the run which finished with bash -e. But, the three counts files only have the headers and no other information.

PFA. nohup.txt

Best, Hasna

wmckerrow commented 3 years ago

The error you are seeing in step 4 means that at least one of the bam files created in step 3 was empty. Most often this occurs when the system runs out of memory and silently kills one of the processes. You can try the more memory efficient run_L1EM_withlessmemory.sh script and/or run with more memory if that's an option.

hchetia commented 3 years ago

Hi @wmckerrow I am running the program in a cluster with good memory (500 GB). How shall I specify a higher memory for the program?

wmckerrow commented 3 years ago

How you specify memory for a just would depend on your cluster set up. 500GB should be more than enough memory. You can try the run_L1EM_withlessmemory.sh script that I uploaded recently. It sometimes helps with similar errors.

hchetia commented 3 years ago

Hi I will try a run with that script and update

hchetia commented 3 years ago

Hi @wmckerrow,

I reran with the less memory script and it differently worked further than the older script. However, I am not sure if the program ended successfully or not. I specially see one error about a paired end read set with different read names. I have attached the log of the run here. nohup1.txt Please advise. -Hasna