clicumu / doepipeline

A python package for optimizing processing pipelines using statistical design of experiments (DoE).
MIT License
23 stars 2 forks source link

Slurm error: PipelineRunFailed #6

Closed druvus closed 8 years ago

druvus commented 8 years ago

I have some problem to get my pipeline to work correctly using slurm. The same pipeline works nicly using local executor in serial mode. Using Uppmax (/proj/nobackup/b2015353/scaffolding/) with the those files.

The output indicates that the job failed but it seems that it finished correctly.

Andreas-MBP-6:scaffolding_optimization andreassjodin$ python links_execute_1.py
/Users/andreassjodin/anaconda/lib/python3.5/site-packages/pyDOE-0.3.8-py3.5.egg/pyDOE/doe_factorial.py:78: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
design:    KMER  DVALUE
0  15.0  1000.0
1  25.0  1000.0
2  15.0  4000.0
3  25.0  4000.0
4  15.0  2500.0
5  25.0  2500.0
6  20.0  1000.0
7  20.0  4000.0
8  20.0  2500.0
LINKSScaffolder_exp_7 has failed. (exit code 127:0)
Traceback (most recent call last):
  File "links_execute_1.py", line 19, in 
    results = executor.run_pipeline_collection(pipeline)
  File "/Users/andreassjodin/anaconda/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/base.py", line 199, in run_pipeline_collection
    self.run_jobs(job_steps, experiment_index, env_variables, **kwargs)
  File "/Users/andreassjodin/anaconda/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/mixins.py", line 349, in run_jobs
    self.wait_until_current_jobs_are_finished()
  File "/Users/andreassjodin/anaconda/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/base.py", line 246, in wait_until_current_jobs_are_finished
    raise PipelineRunFailed(msg)
doepipeline.executor.base.PipelineRunFailed: LINKSScaffolder_exp_7 has failed. (exit code 127:0)

Not sure what I did wrong so I would be helpful with advice how to fix it.

RickardSjogren commented 8 years ago

PipelineRunFailed is raised when poll_jobs of the current executor returns JOB_FAILED. SlurmExecutorMixin.poll_jobs parses the output from SLURM command sacct -j <job_id> and fails if the status simply is failed (see here) or the job was exited if not running on SLURM (see here).

In this case it is SLURM which returned that the job failed for some reason. Do you have any logs from the SLURM-run which might help to narrow down where and why it failed? You could also inspect the resulting shell-file which is executed at each step to see if it was created correctly, there should be file named something like <stepname>_exp_<experiment name/number.sh> in your working directory I think.

druvus commented 8 years ago

You are right. It is not a doepipeline issue. The script is running successfully but slurm thinks it is failing. I will close and try to understand why it fails.

druvus commented 8 years ago

The reason for the problem was that the LINKS author used "die and exit 1" code in the perl code when the script successfully finished. The scaffolding test case is running nicely at UPPMAX after updating the LINKS perl code.