epi2me-labs / wf-somatic-variation

Other
10 stars 5 forks source link

Error in snv:clairs_predict_pileup_indel #17

Open alexcoppe opened 2 months ago

alexcoppe commented 2 months ago

Operating System

Other Linux (please specify below)

Other Linux

Ubuntu

Workflow Version

v1.1.0

Workflow Execution

Command line

EPI2ME Version

No response

CLI command run

nextflow-23.10.0-all run epi2me-labs/wf-somatic-variation -profile singularity -resume -process.executor 'pbspro' -process.cpus 64 -process.memory 256.GB -latest -work-dir '/archive/s2/genomics/onco_nanopore/test' -with-timeline --snv --sv --mod --sample_name 'OHU0002HI' --bam_normal '/archive/s2/genomics/onco_nanopore/HUM_OHU_OHU0002HTNDN/OHU0002HTNDN.bam' --bam_tumor '/archive/s2/genomics/onco_nanopore/HUM_OHU_OHU0002ITTDN/OHU0002ITTDN.bam' --ref '/archive/s1/sconsRequirements/databases/reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta' --out_dir '/archive/s2/genomics/onco_nanopore/OHU0002HI_wf-somatic-variation_2024_04_16' --basecaller_cfg 'dna_r10.4.1_e8.2_400bps_sup@v4.2.0' --phase_normal --classify_insert --force_strand --normal_min_coverage 0 --tumor_min_coverage 0

Workflow Execution - CLI Execution Profile

singularity

What happened?

It stopped and showed the bellow error

Relevant log output

ERROR ~ Error executing process > 'snv:clairs_predict_pileup_indel (143)'

Caused by:
  Process `snv:clairs_predict_pileup_indel (143)` terminated with an error exit status (134)

Command executed:

  mkdir vcf_output/
  python3 $CLAIRS_PATH/clairs.py predict \
      --tensor_fn indel_pileup_tensor_can/chr3.24_0_1_indel \
      --call_fn vcf_output/indel_p_chr3.24_0_1_indel.vcf \
      --chkpnt_fn ${CLAIR_MODELS_PATH}/ont_r10_dorado_sup_5khz/indel/pileup.pkl \
      --platform ont \
      --use_gpu False \
      --ctg_name chr3 \
      --pileup \
      --enable_indel_calling True \
       \

Command exit status:
  134

Command output:
  (empty)

Command error:
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
  OMP: Error #15: Initializing libomp.so, but found unknown library already initialized.
  OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
  .command.sh: line 13:    70 Aborted                 (core dumped) python3 $CLAIRS_PATH/clairs.py predict --tensor_fn indel_pileup_tensor_can/chr3.24_0_1_indel --call_fn vcf_output/indel_p_chr3.24_0_1_indel.vcf --chkpnt_fn ${CLAIR_MODELS_PATH}/ont_r10_dorado_sup_5khz/indel/pileup.pkl --platform ont --use_gpu False --ctg_name chr3 --pileup --enable_indel_calling True

Work dir:
  /archive/s2/genomics/onco_nanopore/test/ce/1234f1fc8b088439eb03ba4f9933a3

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Application activity log entry

No response

Were you able to successfully run the latest version of the workflow with the demo data?

other (please describe below)

Other demo data information

Didn't do it
oneillkza commented 2 months ago

I am getting this exact same error on a CentOS 7 cluster.

Command error:
  OMP: Error #15: Initializing libomp.so, but found unknown library already initialized.
  OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
  .command.sh: line 12:    41 Aborted                 (core dumped) python3 $CLAIRS_PATH/clairs.py predict --tensor_fn pileup_tensor_can/chr2.33_0_1 --call_fn vcf_output/p_chr2.33_0_1.vcf --chkpnt_fn ${CLAIR_MODELS_PATH}/ont_r10_dorado_sup_5khz/pileup.pkl --platform ont --use_gpu False --ctg_name chr2 --pileup

It seems to be somewhat sporadic -- about half the instances of snv:clairs_predict_pileup_indel and snv:clairs_predict_pileup_snv are running to completion.

I also cannot reproduce it. I can got into the work directory and run bash .command.run, and it completes without issue. This includes re-running on the exact same node on our cluster where it failed the first time.

RenzoTale88 commented 2 months ago

@oneillkza @alexcoppe yes, this is an error that we observe spuriously in some clusters. Most times, restart the workflow with -resume will cause the process to succeed without displaying the same issue. We are looking into this, thanks for reporting it

alexcoppe commented 2 months ago

@oneillkza @alexcoppe yes, this is an error that we observe spuriously in some clusters. Most times, restart the workflow with -resume will cause the process to succeed without displaying the same issue. We are looking into this, thanks for reporting it

@RenzoTale88, thank you for the help. I restarted the workflow with -resume a couple of times and ended up with the same error :frowning_face:

RenzoTale88 commented 2 months ago

@alexcoppe thanks for confirming that the issue persists. Did the error occurr at the same process (i.e. same chunk) or at a different chunk?

alexcoppe commented 2 months ago

@RenzoTale88 exactly the same error than above

oneillkza commented 1 month ago

I have found when restarting with -resume that it seems to get through the specific job that it failed on before, but tends to fail again later on. I got through most of what I think were the jobs for this process by resubmitting about a dozen times.

I've also been trying to add a retry option to the process (currently it only retries on certain error codes but not this one), since I think that would likely solve this. But nextflow seems to be ignoring the contents of the config file I pass it. (It won't even override memory requirements for processes.) I'm still not sure what's going on there.

rhagelaar commented 1 month ago

I was wondering if there is already a fix for this issue? This error keeps on reoccurring

RenzoTale88 commented 1 month ago

@rhagelaar @oneillkza we are investigating this, will keep you updated!

selmapichot commented 4 weeks ago

Hi, just to report that I am having the exact same issue. Please keep us updated of any fix :)

RenzoTale88 commented 4 weeks ago

@selmapichot @rhagelaar @oneillkza @alexcoppe sorry for the slow progress in this. We are still doing some investigations on what is causing this issue. In the meanwhile, could you please try the latest release of the workflow (v.1.2.1) and check if the workflow gets to completion?

alexcoppe commented 3 weeks ago

Tried it but give me an error regarding the fact that the bam contains reads basecalled with more than one basecaller model. I opened a new Issue. Thank you very much for your help

RenzoTale88 commented 3 weeks ago

Data from mixed models are not supported by ClairS, so please ensure that you have data called with a single basecall model and try again.