Open kelly-sovacool opened 3 weeks ago
wilfried ran into this issue too. snakemake rules succeeded but the overall slurm job failed.
retries is already set to 2 for both local and slurm mode 🤔
https://github.com/CCBR/CHARLIE/blob/d8f9cf012ec50e08c68ccf70f769bae255e239a7/charlie#L489
https://github.com/CCBR/CHARLIE/blob/d8f9cf012ec50e08c68ccf70f769bae255e239a7/charlie#L531
it seems snakemake is not honoring it?
@kelly-sovacool ... can you point me to the output folder.. I am looking for the jobinfo.short
file.
@kopardev
@kelly-sovacool ... can you point me to the output folder.. I am looking for the
jobinfo.short
file.
Wilfried's is here:
/data/CCBR/charlie_test_wil/charlie/
jobby short file /data/charlie_test_wil/charlie/logs/snakemake.log.jobby.short
/data/CCBR/charlie_test_wil/charlie
folder .. but there is no jobby related file there ... why? @kelly-sovacool /data/charlie_test_wil/charlie/logs/snakemake.log.jobby.short
@wilfriedguibletIt appears to me that rule create_hq_bams
is failing for both samples after retrying 2 times:
... so looking into err file shows
â•â”€kopardevn at helix in /data/CCBR/charlie_test_wil/charlie/logs 24-10-22 - 22:07:39 - 1710
╰─○ cat /data/CCBR/charlie_test_wil/charlie/logs/'38952183.38954065.create_hq_bams.sample=GI1_N.err'
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954
Select jobs to execute...
[Tue Oct 22 15:15:45 2024]
rule create_hq_bams:
input: /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam, /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz
output: /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
jobid: 0
reason: Forced execution
wildcards: sample=GI1_N
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp
set -exo pipefail
outdir=$(dirname /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam)
if [ ! -d $outdir ];then mkdir -p $outdir;fi
cd $outdir
python3 /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/scripts/_bam_filter_BSJ_for_HQonly.py \
-i /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam \
-t /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz \
-o /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam \
--regions /data/CCBR/charlie_test_wil/charlie/ref/ref.fa.regions \
--host "hg38" \
--additives "ERCC" \
--viruses "NC_009333.1" \
--sample_name GI1_N
samtools index /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
for bam in $(ls GI1_N.*.HQ_only.BSJ.bam);do
if [ ! -f "${bam}.bai" ];then
samtools index $bam
fi
done
Activating singularity image /vf/users/CCBR/charlie_test_wil/charlie/.snakemake/singularity/33cc10ca451509d6b721cc161a2d638c.simg
++ dirname /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
+ outdir=/data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams
+ '[' '!' -d /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams ']'
+ cd /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams
+ python3 /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/scripts/_bam_filter_BSJ_for_HQonly.py -i /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam -t /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz -o /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam --regions /data/CCBR/charlie_test_wil/charlie/ref/ref.fa.regions --host hg38 --additives ERCC --viruses NC_009333.1 --sample_name GI1_N
Traceback (most recent call last):
File "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/scripts/_bam_filter_BSJ_for_HQonly.py", line 2, in <module>
import pandas as pd
File "/data/CCBR_Pipeliner/Tools/ccbr_tools/v0.1/pandas/__init__.py", line 19, in <module>
raise ImportError(
ImportError: Unable to import required dependencies:
numpy: Error importing numpy: you should not try to import numpy from
its source directory; please exit the numpy source tree, and relaunch
your python interpreter from there.
[Tue Oct 22 15:15:46 2024]
Error in rule create_hq_bams:
jobid: 0
input: /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam, /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz
output: /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
shell:
set -exo pipefail
outdir=$(dirname /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam)
if [ ! -d $outdir ];then mkdir -p $outdir;fi
cd $outdir
python3 /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/scripts/_bam_filter_BSJ_for_HQonly.py \
-i /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam \
-t /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz \
-o /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam \
--regions /data/CCBR/charlie_test_wil/charlie/ref/ref.fa.regions \
--host "hg38" \
--additives "ERCC" \
--viruses "NC_009333.1" \
--sample_name GI1_N
samtools index /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
for bam in $(ls GI1_N.*.HQ_only.BSJ.bam);do
if [ ! -f "${bam}.bai" ];then
samtools index $bam
fi
done
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Basically seems like numpy
could not be correctly imported ... I dont know why. This may be related to the python error that you were showing me earlier @kelly-sovacool
Basically seems like
numpy
could not be correctly imported ... I dont know why. This may be related to the python error that you were showing me earlier @kelly-sovacool
yes this is the same python error as before -- looks like I missed that docker.
- I have access to
/data/CCBR/charlie_test_wil/charlie
folder .. but there is no jobby related file there ... why? @kelly-sovacool
@kopardev charlie writes the jobby files in logs/
Basically seems like
numpy
could not be correctly imported ... I dont know why. This may be related to the python error that you were showing me earlier @kelly-sovacoolyes this is the same python error as before -- looks like I missed that docker.
@kelly-sovacool Can you please update the docker and ask @wilfriedguiblet to try again?
Basically seems like
numpy
could not be correctly imported ... I dont know why. This may be related to the python error that you were showing me earlier @kelly-sovacoolyes this is the same python error as before -- looks like I missed that docker.
@kelly-sovacool Can you please update the docker and ask @wilfriedguiblet to try again?
@kopardev yes I'm in the middle of that here https://github.com/CCBR/Dockers/pull/35
Getting back to the retries / file latency issue:
Here's another output dir where I only ran charlie once and did not manually resubmit it: /data/CCBR/projects/techDev/charlie_test_rel-7/
grep FAIL logs/snakemake.log.jobby
star_circrnafinder.sample=GI1_N FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044706.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044709.star_circrnafinder.sample=GI1_N.err
merge_alignment_stats. FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048346.merge_alignment_stats..err
merge_alignment_stats. FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048355.merge_alignment_stats..err
create_hq_bams.sample=GI1_N FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048998.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049001.create_hq_bams.sample=GI1_T.err
create_hq_bams.sample=GI1_N FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049044.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049045.create_hq_bams.sample=GI1_T.err
create_hq_bams.sample=GI1_N FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049057.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049058.create_hq_bams.sample=GI1_T.err
It looks like it is correctly resubmitting failed jobs with --retries 2
.
But rules seem to be failing due to missing output files on the first attempt even though they do exist.
Error message for attempt 1:
Waiting at most 120 seconds for missing files.
MissingOutputException in rule star_circrnafinder in file /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/rules/align.smk, line 438:
Job 0 completed successfully, but some output files are missing. Missing files after 120 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.Chimeric.out.sam
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.Chimeric.out.junction
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.SJ.out.tab
Error message for attempt 2:
FATAL INPUT error, could not open input file with junctions from the 1st pass=GI1_N._STARpass1//SJ.out.tab
It completed successfully on the 3rd attempt:
grep star_circrnafinder.sample=GI1_N logs/snakemake.log.jobby
star_circrnafinder.sample=GI1_N FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044706.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044709.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N COMPLETED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39046188.star_circrnafinder.sample=GI1_N.err
Error message for attempt 1:
paste: /data/CCBR/projects/techDev/charlie_test_rel-7/results/alignmentstats.txt: No such file or directory
Error message for attempt 2:
cp: cannot create regular file '/data/CCBR/projects/techDev/charlie_test_rel-7/results/alignmentstats.txt': File exists
It completed successfully on the 3rd attempt.
grep merge_alignment_stats logs/snakemake.log.jobby
merge_alignment_stats. FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048346.merge_alignment_stats..err
merge_alignment_stats. FAILED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048355.merge_alignment_stats..err
merge_alignment_stats. COMPLETED /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048379.merge_alignment_stats..err
All of these jobs failed due to an import error which will be resolved by upgrading the base container to v7 (https://github.com/CCBR/CHARLIE/pull/125). This is unrelated the current issue.
grep "ImportError" logs/*create_hq*
logs/39030852.39048998.create_hq_bams.sample=GI1_N.err: raise ImportError(
logs/39030852.39048998.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049001.create_hq_bams.sample=GI1_T.err: raise ImportError(
logs/39030852.39049001.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:
logs/39030852.39049044.create_hq_bams.sample=GI1_N.err: raise ImportError(
logs/39030852.39049044.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049045.create_hq_bams.sample=GI1_T.err: raise ImportError(
logs/39030852.39049045.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:
logs/39030852.39049057.create_hq_bams.sample=GI1_N.err: raise ImportError(
logs/39030852.39049057.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049058.create_hq_bams.sample=GI1_T.err: raise ImportError(
logs/39030852.39049058.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:
@kopardev So I think the overall workflow will succeed once we fix the container problem with #125. But for this issue, it would be best if we could figure out how to avoid needing to retry rules multiple times. Should we try increasing latency wait?
@kelly-sovacool:
--latency-wait
? 300?
- what do you have in mind for
--latency-wait
? 300?
I hesitate to go too high because that will needlessly delay the overall pipeline run completion. Should we reach out to biowulf staff about this?
- another peculiar observation: if something fails .. it fails twice and succeeds on attempt no. 3... I could not find any rule which failed on attempt no. 1 and succeeded on attempt no. 2 ... Have you?
I thought this was true, until we encountered #127
@kelly-sovacool is there a good root-cause for this yet? Else, we move this to Backlog with latency set to 300 and reaching out to Biowulf staff.
@kelly-sovacool is there a good root-cause for this yet? Else, we move this to Backlog with latency set to 300 and reaching out to Biowulf staff.
so far I have not encountered this error recently, even with the original --latency-wait 120
, despite multiple charlie runs that failed for other reasons (#127, #128)
@kopardev found jobs will sometimes fail spontaneously and work on the re-run. It seems to be a file latency issue?