CCBR / CHARLIE

Circrnas in Host And viRuses anaLysis pIpEline for Detection Annotation Quantification of circRNAs
https://ccbr.github.io/CHARLIE/
MIT License
2 stars 1 forks source link

snakemake jobs failing due to missing output files which do exist #123

Open kelly-sovacool opened 3 weeks ago

kelly-sovacool commented 3 weeks ago

@kopardev found jobs will sometimes fail spontaneously and work on the re-run. It seems to be a file latency issue?

kelly-sovacool commented 3 weeks ago

wilfried ran into this issue too. snakemake rules succeeded but the overall slurm job failed.

kelly-sovacool commented 3 weeks ago

retries is already set to 2 for both local and slurm mode 🤔

https://github.com/CCBR/CHARLIE/blob/d8f9cf012ec50e08c68ccf70f769bae255e239a7/charlie#L489

https://github.com/CCBR/CHARLIE/blob/d8f9cf012ec50e08c68ccf70f769bae255e239a7/charlie#L531

it seems snakemake is not honoring it?

kopardev commented 3 weeks ago

@kelly-sovacool ... can you point me to the output folder.. I am looking for the jobinfo.short file.

kelly-sovacool commented 3 weeks ago

@kopardev

@kelly-sovacool ... can you point me to the output folder.. I am looking for the jobinfo.short file.

Wilfried's is here: /data/CCBR/charlie_test_wil/charlie/

jobby short file /data/charlie_test_wil/charlie/logs/snakemake.log.jobby.short

kopardev commented 3 weeks ago
kopardev commented 3 weeks ago

It appears to me that rule create_hq_bams is failing for both samples after retrying 2 times: image ... so looking into err file shows

╭─kopardevn at helix in /data/CCBR/charlie_test_wil/charlie/logs 24-10-22 - 22:07:39 - 1710
╰─○ cat /data/CCBR/charlie_test_wil/charlie/logs/'38952183.38954065.create_hq_bams.sample=GI1_N.err'
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954
Select jobs to execute...

[Tue Oct 22 15:15:45 2024]
rule create_hq_bams:
    input: /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam, /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz
    output: /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
    jobid: 0
    reason: Forced execution
    wildcards: sample=GI1_N
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp

set -exo pipefail
outdir=$(dirname /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam)
if [ ! -d $outdir ];then mkdir -p $outdir;fi
cd $outdir
python3 /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/scripts/_bam_filter_BSJ_for_HQonly.py \
    -i /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam \
    -t /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz \
    -o /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam \
    --regions /data/CCBR/charlie_test_wil/charlie/ref/ref.fa.regions \
    --host "hg38" \
    --additives "ERCC" \
    --viruses "NC_009333.1" \
    --sample_name GI1_N
samtools index /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
for bam in $(ls GI1_N.*.HQ_only.BSJ.bam);do
    if [ ! -f "${bam}.bai" ];then
        samtools index $bam
    fi
done

Activating singularity image /vf/users/CCBR/charlie_test_wil/charlie/.snakemake/singularity/33cc10ca451509d6b721cc161a2d638c.simg
++ dirname /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
+ outdir=/data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams
+ '[' '!' -d /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams ']'
+ cd /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams
+ python3 /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/scripts/_bam_filter_BSJ_for_HQonly.py -i /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam -t /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz -o /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam --regions /data/CCBR/charlie_test_wil/charlie/ref/ref.fa.regions --host hg38 --additives ERCC --viruses NC_009333.1 --sample_name GI1_N
Traceback (most recent call last):
  File "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/scripts/_bam_filter_BSJ_for_HQonly.py", line 2, in <module>
    import pandas as pd
  File "/data/CCBR_Pipeliner/Tools/ccbr_tools/v0.1/pandas/__init__.py", line 19, in <module>
    raise ImportError(
ImportError: Unable to import required dependencies:
numpy: Error importing numpy: you should not try to import numpy from
        its source directory; please exit the numpy source tree, and relaunch
        your python interpreter from there.
[Tue Oct 22 15:15:46 2024]
Error in rule create_hq_bams:
    jobid: 0
    input: /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam, /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz
    output: /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
    shell:

set -exo pipefail
outdir=$(dirname /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam)
if [ ! -d $outdir ];then mkdir -p $outdir;fi
cd $outdir
python3 /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/scripts/_bam_filter_BSJ_for_HQonly.py \
    -i /data/CCBR/charlie_test_wil/charlie/results/GI1_N/circExplorer/GI1_N.BSJ.bam \
    -t /data/CCBR/charlie_test_wil/charlie/results/circRNA_master_counts.tsv.gz \
    -o /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam \
    --regions /data/CCBR/charlie_test_wil/charlie/ref/ref.fa.regions \
    --host "hg38" \
    --additives "ERCC" \
    --viruses "NC_009333.1" \
    --sample_name GI1_N
samtools index /data/CCBR/charlie_test_wil/charlie/results/HQ_BSJ_bams/GI1_N.HQ_only.BSJ.bam
for bam in $(ls GI1_N.*.HQ_only.BSJ.bam);do
    if [ ! -f "${bam}.bai" ];then
        samtools index $bam
    fi
done

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Basically seems like numpy could not be correctly imported ... I dont know why. This may be related to the python error that you were showing me earlier @kelly-sovacool

kelly-sovacool commented 3 weeks ago

Basically seems like numpy could not be correctly imported ... I dont know why. This may be related to the python error that you were showing me earlier @kelly-sovacool

yes this is the same python error as before -- looks like I missed that docker.

kelly-sovacool commented 3 weeks ago
  • I have access to /data/CCBR/charlie_test_wil/charlie folder .. but there is no jobby related file there ... why? @kelly-sovacool

@kopardev charlie writes the jobby files in logs/

kopardev commented 3 weeks ago

Basically seems like numpy could not be correctly imported ... I dont know why. This may be related to the python error that you were showing me earlier @kelly-sovacool

yes this is the same python error as before -- looks like I missed that docker.

@kelly-sovacool Can you please update the docker and ask @wilfriedguiblet to try again?

kelly-sovacool commented 3 weeks ago

Basically seems like numpy could not be correctly imported ... I dont know why. This may be related to the python error that you were showing me earlier @kelly-sovacool

yes this is the same python error as before -- looks like I missed that docker.

@kelly-sovacool Can you please update the docker and ask @wilfriedguiblet to try again?

@kopardev yes I'm in the middle of that here https://github.com/CCBR/Dockers/pull/35

kelly-sovacool commented 3 weeks ago

Getting back to the retries / file latency issue:

Here's another output dir where I only ran charlie once and did not manually resubmit it: /data/CCBR/projects/techDev/charlie_test_rel-7/

grep FAIL logs/snakemake.log.jobby

star_circrnafinder.sample=GI1_N FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044706.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044709.star_circrnafinder.sample=GI1_N.err
merge_alignment_stats.  FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048346.merge_alignment_stats..err
merge_alignment_stats.  FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048355.merge_alignment_stats..err
create_hq_bams.sample=GI1_N     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048998.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049001.create_hq_bams.sample=GI1_T.err
create_hq_bams.sample=GI1_N     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049044.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049045.create_hq_bams.sample=GI1_T.err
create_hq_bams.sample=GI1_N     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049057.create_hq_bams.sample=GI1_N.err
create_hq_bams.sample=GI1_T     FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39049058.create_hq_bams.sample=GI1_T.err

It looks like it is correctly resubmitting failed jobs with --retries 2.

But rules seem to be failing due to missing output files on the first attempt even though they do exist.

star_circrnafinder

Error message for attempt 1:

Waiting at most 120 seconds for missing files.
MissingOutputException in rule star_circrnafinder in file /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHARLIE/.v0.11.1/workflow/rules/align.smk, line 438:
Job 0 completed successfully, but some output files are missing. Missing files after 120 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.Chimeric.out.sam
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.Chimeric.out.junction
/data/CCBR/projects/techDev/charlie_test_rel-7/results/GI1_N/STAR_circRNAFinder/GI1_N.SJ.out.tab

Error message for attempt 2:

FATAL INPUT error, could not open input file with junctions from the 1st pass=GI1_N._STARpass1//SJ.out.tab

It completed successfully on the 3rd attempt:

grep star_circrnafinder.sample=GI1_N logs/snakemake.log.jobby

star_circrnafinder.sample=GI1_N FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044706.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39044709.star_circrnafinder.sample=GI1_N.err
star_circrnafinder.sample=GI1_N COMPLETED       /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39046188.star_circrnafinder.sample=GI1_N.err

merge_alignment_stats

Error message for attempt 1:

paste: /data/CCBR/projects/techDev/charlie_test_rel-7/results/alignmentstats.txt: No such file or directory

Error message for attempt 2:

cp: cannot create regular file '/data/CCBR/projects/techDev/charlie_test_rel-7/results/alignmentstats.txt': File exists

It completed successfully on the 3rd attempt.

grep merge_alignment_stats logs/snakemake.log.jobby

merge_alignment_stats.  FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048346.merge_alignment_stats..err
merge_alignment_stats.  FAILED  /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048355.merge_alignment_stats..err
merge_alignment_stats.  COMPLETED       /vf/users/CCBR/projects/techDev/charlie_test_rel-7/logs/39030852.39048379.merge_alignment_stats..err

create_hq_bams

All of these jobs failed due to an import error which will be resolved by upgrading the base container to v7 (https://github.com/CCBR/CHARLIE/pull/125). This is unrelated the current issue.

grep "ImportError" logs/*create_hq*

logs/39030852.39048998.create_hq_bams.sample=GI1_N.err:    raise ImportError(
logs/39030852.39048998.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049001.create_hq_bams.sample=GI1_T.err:    raise ImportError(
logs/39030852.39049001.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:
logs/39030852.39049044.create_hq_bams.sample=GI1_N.err:    raise ImportError(
logs/39030852.39049044.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049045.create_hq_bams.sample=GI1_T.err:    raise ImportError(
logs/39030852.39049045.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:
logs/39030852.39049057.create_hq_bams.sample=GI1_N.err:    raise ImportError(
logs/39030852.39049057.create_hq_bams.sample=GI1_N.err:ImportError: Unable to import required dependencies:
logs/39030852.39049058.create_hq_bams.sample=GI1_T.err:    raise ImportError(
logs/39030852.39049058.create_hq_bams.sample=GI1_T.err:ImportError: Unable to import required dependencies:
kelly-sovacool commented 3 weeks ago

@kopardev So I think the overall workflow will succeed once we fix the container problem with #125. But for this issue, it would be best if we could figure out how to avoid needing to retry rules multiple times. Should we try increasing latency wait?

kopardev commented 3 weeks ago

@kelly-sovacool:

kelly-sovacool commented 2 weeks ago
  • what do you have in mind for --latency-wait? 300?

I hesitate to go too high because that will needlessly delay the overall pipeline run completion. Should we reach out to biowulf staff about this?

  • another peculiar observation: if something fails .. it fails twice and succeeds on attempt no. 3... I could not find any rule which failed on attempt no. 1 and succeeded on attempt no. 2 ... Have you?

I thought this was true, until we encountered #127

kopardev commented 1 week ago

@kelly-sovacool is there a good root-cause for this yet? Else, we move this to Backlog with latency set to 300 and reaching out to Biowulf staff.

kelly-sovacool commented 1 week ago

@kelly-sovacool is there a good root-cause for this yet? Else, we move this to Backlog with latency set to 300 and reaching out to Biowulf staff.

so far I have not encountered this error recently, even with the original --latency-wait 120, despite multiple charlie runs that failed for other reasons (#127, #128)