YeoLab / merge_peaks

Pipeline for using IDR to produce a set of peaks given two replicate eCLIP peaks
9 stars 7 forks source link

Failed Jobs #10

Open Sanat-Mishra opened 3 years ago

Sanat-Mishra commented 3 years ago

Hi,

On executing the yaml file, all 7 jobs are failing with the following error message:

toil.leader.FailedJobsException: The job store 'file:/scratch/sanat.mishra/Thesis/eclip/merge_peaks/examples/AARS/.tmp/cwltoil_jobstore' contains 7 failed jobs: 'CWLWorkflow' kind-CWLWorkflow/instance-07ufg5ik, 'CWLWorkflow' kind-CWLWorkflow/instance-te8ikfkj, 'CWLJob' samtools view kind-CWLJob/instance-mlvmb13t, 'CWLJob' samtools view kind-CWLJob/instance-ise3vpe9, 'CWLWorkflow' kind-CWLWorkflow/instance-2zfokdt3, 'CWLJob' samtools view kind-CWLJob/instance-b_j5t4td, 'CWLJob' samtools view kind-CWLJob/instance-csv4vy5v

Additionally, for each job, the following is also included -

[2021-08-26T20:43:08+0200] [MainThread] [W] [toil.leader] Job 'CWLJob' samtools view kind-CWLJob/instance-b_j5t4td is completely failed
[2021-08-26T20:43:08+0200] [MainThread] [W] [toil.leader] Job failed with exit value 127: 'CWLJob' samtools view kind-CWLJob/instance-csv4vy5v
Exit reason: None
[2021-08-26T20:43:08+0200] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'CWLJob' samtools view kind-CWLJob/instance-csv4vy5v
[2021-08-26T20:43:08+0200] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'CWLJob' samtools view kind-CWLJob/instance-csv4vy5v with ID kind-CWLJob/instance-csv4vy5v to 0

What's going wrong? Is there an issue with samtools?

byee4 commented 3 years ago

Might be, can you try re-running the pipeline in serial (using cwl-ref runner instead of cwltoil)? It should just be a modification to this line (change to cwltool):

https://github.com/YeoLab/merge_peaks/blob/18933d4d4b00e97a8a0d155abbebad1fdbc254aa/wf/eCLIP_merge_peaks#L20

Sanat-Mishra commented 3 years ago

But I'm executing on a Slurm scheduler, hope this goes with it.

byee4 commented 3 years ago

Yes the reference runner should work but your job will only run on the current node. Just to make sure that the issues aren't related to your scheduler first

Sanat-Mishra commented 3 years ago

Hi,

It turns out that the tmp directory where the logs are written into was not accessible, so I explicitly defined one and found out that the error is actually as follows - /var/spool/slurm/d/job8920124/slurm_script: line 4: _toil_worker: command not found

I tried to find out why this is occurring, someone suggested installing toil using a Python virtualenv instead of a conda env. Unfortunately, that did not work out as well. Any leads on this issue?

Thanks.