SBIMB / StellarPGx

Calling star alleles in highly polymorphic pharmacogenes (e.g. CYP450 genes) by leveraging genome graph-based variant detection.
MIT License
30 stars 7 forks source link

terminated with an error exit status (140) #30

Closed muhligs closed 1 year ago

muhligs commented 1 year ago

Hi,

I am running the StellarPGx test run in a cluster setting. I get the following error:

nextflow run main.nf -profile slurm,test
N E X T F L O W  ~  version 22.10.6
Launching `main.nf` [elated_borg] DSL1 - revision: 6d430c4597
executor >  slurm (10)
[74/2fa1d9] process > call_snvs1 (1)         [100%] 1 of 1 ✔
[d1/81f5d3] process > call_snvs2 (1)         [100%] 1 of 1 ✔
[fa/aa5f6e] process > call_sv_del (1)        [100%] 1 of 1 ✔
[4f/168d3d] process > call_sv_dup (1)        [100%] 1 of 1 ✔
[eb/e92d2c] process > get_depth (1)          [100%] 1 of 1 ✔
[ce/8b52e9] process > format_snvs (1)        [100%] 1 of 1 ✔
[40/c0b58a] process > get_core_var (HG03130) [100%] 1 of 1 ✔
[a1/23a849] process > analyse_1 (HG03130)    [100%] 1 of 1, failed: 1 ✔
[57/38072c] process > analyse_2 (HG03130)    [100%] 1 of 1, failed: 1 ✔
[a9/5c31c3] process > analyse_3 (HG03130)    [100%] 1 of 1, failed: 1 ✔
[-        ] process > call_stars             -
[a9/5c31c3] NOTE: Process `analyse_3 (HG03130)` terminated with an error exit status (140) -- Error is ignored
[57/38072c] NOTE: Process `analyse_2 (HG03130)` terminated with an error exit status (140) -- Error is ignored

Completed at: 19-Jun-2023 16:26:12
Duration    : 4h 47s
CPU hours   : 12.0 (100% failed)
Succeeded   : 7
Ignored     : 3
Failed      : 3

After the run, there is a ./results/cyp2d6/variants  directory with vcf and index, but no 'alleles' directory. 
It appears that the three failed 'analyse'-jobs timed out (4 hour walltime). 
Is this expected for the 'test' run? 
How to adjust the timing for the jobs (Sorry, I am new to nextflow)?

Thx, Morten
twesigomwedavid commented 1 year ago

@muhligs, I think this might also be something to do with the settings on your cluster. What error do you get in work/a1/23a8499[auto_complete_with_tab]/.command.err

muhligs commented 1 year ago

I get:

WARNING: /etc/singularity/ exists, migration to apptainer by system administrator is not complete
WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_TMP as environment variable will not be supported in the future, use APPTAINERENV_TMP instead
WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_TMPDIR as environment variable will not be supported in the future, use APPTAINERENV_TMPDIR instead
muhligs commented 1 year ago

in 40/c0b58a6da8b0e7c88813085784b276/.comman.err i have:

WARNING: /etc/singularity/ exists, migration to apptainer by system administrator is not complete
WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_TMP as environment variable will not be supported in the future, use APPTAINERENV_TMP instead
WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_TMPDIR as environment variable will not be supported in the future, use APPTAINERENV_TMPDIR instead
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=PASS,Description=""All filters passed"">                         "
[W::bcf_hdr_parse] Could not parse header line: ##FILTER=<ID=PASS,Description=""All filters passed"">                           
Lines   total/split/realigned/skipped:  5/0/0/0
twesigomwedavid commented 1 year ago

@muhligs For Singularity, those are just warnings but not errors. The issue seems to be with BCFtools, but unfortunately, that error is not replicated on my end. It's a bit puzzling because previous steps with BCFtools (e.g. get_core_var) ran fine on your side. May I ask whether you made any modifications to any of the scripts/container?

muhligs commented 1 year ago

Yes, I outcommented the partition choice in nextflow.config: // process.queue = 'batch' Thats all.

bcftools is v. 1.10 from bioconda.

twesigomwedavid commented 1 year ago

I think you should use the BCFtools provided in the Singularity container

muhligs commented 1 year ago

Ok, that basically amounts to uninstalling it from the conda environment, right? Thanks for your time.

Morten

twesigomwedavid commented 1 year ago

Not sure whether it needs to be uninstalled or not. I think you just need to make sure StellarPGx is using the tools provided in the Singularity container (this will prevent any reproducibility issues). There is no need to use the tools on your system other than Nextflow and Singularity themselves.

muhligs commented 1 year ago

Thanks, I will have a new go.

Morten

muhligs commented 1 year ago

Hi again, I made a new run with a conda environment only containing nextflow, made like this:

conda create -n stellar_nextflow
conda activate stellar_nextflow
mamba install nextflow=22.10

Then ran: nextflow run main.nf -profile slurm,test

I unfortunately get the same result:

N E X T F L O W  ~  version 22.10.6
Launching `main.nf` [zen_swirles] DSL1 - revision: 6d430c4597
executor >  slurm (10)
[ba/0ac714] process > call_snvs1 (1)         [100%] 1 of 1 ✔
[ca/8f5c93] process > call_snvs2 (1)         [100%] 1 of 1 ✔
[2c/d220da] process > call_sv_del (1)        [100%] 1 of 1 ✔
[9c/0168bf] process > call_sv_dup (1)        [100%] 1 of 1 ✔
[e0/f3b9df] process > get_depth (1)          [100%] 1 of 1 ✔
[94/3c46e8] process > format_snvs (1)        [100%] 1 of 1 ✔
[1d/7976f0] process > get_core_var (HG03130) [100%] 1 of 1 ✔
[48/00afb7] process > analyse_1 (HG03130)    [100%] 1 of 1, failed: 1 ✔
[d7/784824] process > analyse_2 (HG03130)    [100%] 1 of 1, failed: 1 ✔
[c7/16b97d] process > analyse_3 (HG03130)    [100%] 1 of 1, failed: 1 ✔
[-        ] process > call_stars             -
[c7/16b97d] NOTE: Process `analyse_3 (HG03130)` terminated with an error exit status (140) -- Error is ignored
[d7/784824] NOTE: Process `analyse_2 (HG03130)` terminated with an error exit status (140) -- Error is ignored

I am suspecting that it may have something to do with account specification. Is there a way to specify the account from the commandline along with the nextflow command? (sorry, I never used nextflow before)

Thanks, Morten

twesigomwedavid commented 1 year ago

@muhligs Thanks. I think what I am not understanding is why you need to use conda (or a conda environment) at all? Nextflow can easily be installed by running curl -s https://get.nextflow.io | bash (see Nextflow documentation). This creates an executable nextflow file in your $PWD which you can move to your $PATH.

StellarPGx has its own Python3 installed in the container. So in principle, there should be no need to use any custom conda environment.

muhligs commented 1 year ago

I plan to use Stellar in a larger pipeline that has a number of specific environment dependencies, that is why I use conda (and an old habit as well I guess). I figured out how to specify the account in nextflow.config. If this does not solve the problem, I will try your suggestion and leave out the conda installation.

Thanks for helping out on this.

Morten

twesigomwedavid commented 1 year ago

Yes, I think I would probably go about this in the reverse way i.e. trying StellarPGx on its own first and making sure it's working on your system before trying to incorporate it into a larger pipeline. That way you can see if the issue is with the dependencies in your conda environment.

If you're incorporating this as part of a larger pipeline, I think you need to find a way to make sure that the tools/versions being used are the ones in the StellarPGx Singularity container as this will prevent any reproducibility issues.

muhligs commented 1 year ago

Hi again,

running curl -s https://get.nextflow.io | bash followed by ./nextflow run main.nf -profile slurm,test results in: N E X T F L O W ~ version 23.04.2 Nextflow DSL1 is no longer supported — Update your script to DSL2, or use Nextflow 22.10.x or earlier

Thanks, Morten

twesigomwedavid commented 1 year ago

@muhligs Thanks. You can install an earlier version of nextflow by getting the pre-compiled release.

Try:

wget https://github.com/nextflow-io/nextflow/releases/download/v22.10.0/nextflow
chmod 777 nextflow  (to make it executable)
./nextflow  (alternatively add the executable nextflow file to your $PATH)
muhligs commented 1 year ago

@twesigomwedavid Unfortunately the result was the same. It is odd that 7 jobs run correctly, it seems not, then, to be a problem with e.g. singularity settings.

twesigomwedavid commented 1 year ago

The other steps earlier also have BCFtools and are running correctly. So I don't think the singularity settings are the issue for just one process.

Have you tried your own data? maybe the issue is just the way the test sample downloaded onto your system.

The other warning that seems unusual is: WARNING: /etc/singularity/ exists, migration to apptainer by system administrator is not complete

Could you check the .command.err of the other processes which worked to see if all three warning messages are still there?

muhligs commented 1 year ago

No, good point, I will try that.

muhligs commented 1 year ago

It turned out that our hpc has a singularity version = apptainer version 1.1.3-1.el8 on the nodes, and it seems that this causes the issue somehow. On our front-end we have singularity version 3.8.5-2.el7, and running nextflow run main.nf -profile standard,test is successful. nextflow run main.nf -profile slurm,test fails because of the shift in singularity version when executing in the slurm queue.

I would like to run it via slurm, so @twesigomwedavid , if you have any experience with this 'apptainer' issue, please let me know. Does it require a whole different image? I guess not, since as noted above, 7 out of 10 jobs were successful.

And again, thanks for your effort in helping out. I think we can close the issue.

Best, Morten

twesigomwedavid commented 1 year ago

I think the issue with slurm might be that those last 3 jobs are being submitted to nodes with the other singularity version. I think you can run it with slurm but you may need to specify which nodes the pipeline is allowed to run on (or exclude the ones with the singularity version that is not supported).

muhligs commented 1 year ago

Yeah, the problem is it's all the nodes. But I am talking to the system guys about it now. Until then I'll run on the frontend. (that should keep them focused on finding a solution :) )