Snakemake-Profiles / lsf

Snakemake profile for running jobs on an LSF cluster
MIT License
36 stars 22 forks source link

Error submitting jobscript, bsub returns exit code 255 #58

Closed W-L closed 4 months ago

W-L commented 1 year ago

Hey! Not quite sure if this is the right place for my issue, as I suspect it's more of a cluster issue than a problem with the profile. But maybe someone can still help. I'm getting randomly failing job submissions on pipelines that usually work fine. The tracebacks are something along the lines of:

Traceback (most recent call last):
  File "/homes/lukasw/.config/snakemake/lsf_short/lsf_submit.py", line 230, in submit
    external_job_id = self._submit_cmd_and_get_external_job_id()
  File "/homes/lukasw/.config/snakemake/lsf_short/lsf_submit.py", line 216, in _submit_cmd_and_get_external_job_id
    output_stream, error_stream = OSLayer.run_process(self.submit_cmd)
  File "/homes/lukasw/.config/snakemake/lsf_short/OSLayer.py", line 40, in run_process
    completed_process = subprocess.run(
  File "[..]/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'bsub -M 1000 -n 1 -R 'select[mem>1000] rusage[mem=1000] span[hosts=1]' -o "logs/cluster/[..]/jobid140_5dad74cf-fda6-41db-9965-ac8a0c18ec25.out" -e "logs/cluster/[..]/jobid140_5dad74cf-fda6-41db-9965-ac8a0c18ec25.err" -J "[..]" -q short [..]/.snakemake/tmp.4tc5r3ou/snakejob.core_metaquast.140.sh' returned non-zero exit status 255.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/homes/lukasw/.config/snakemake/lsf_short/lsf_submit.py", line 259, in <module>
    lsf_submit.submit()
  File "/homes/lukasw/.config/snakemake/lsf_short/lsf_submit.py", line 236, in submit
    raise BsubInvocationError(error)
__main__.BsubInvocationError: Command 'bsub -M 1000 -n 1 -R 'select[mem>1000] rusage[mem=1000] span[hosts=1]' -o "logs/cluster/[..]/jobid140_5dad74cf-fda6-41db-9965-ac8a0c18ec25.out" -e "logs/cluster/[..]/jobid140_5dad74cf-fda6-41db-9965-ac8a0c18ec25.err" -J "[..]" -q short [..]/.snakemake/tmp.4tc5r3ou/snakejob.core_metaquast.140.sh' returned non-zero exit status 255.
Error submitting jobscript (exit code 1):

So bsub returns exit code 255, which leads the profile to raise a BsubInvocationError. Since this issue appears sporadically, I am wondering if this could be caused by file system latency (which is quite high on this system). I.e. the jobscript is not available yet at the time of trying to run bsub, would that make sense? Any ideas how I might try and debug this?

Lastly, could the profile help with mitigating this by e.g. waiting a few seconds between creating the jobscript and submitting it or an option to retry submission after a few seconds instead of raising an error at the first failure to submit? Cheers!

leoisl commented 1 year ago

Lastly, could the profile help with mitigating this by e.g. waiting a few seconds between creating the jobscript and submitting it or an option to retry submission after a few seconds instead of raising an error at the first failure to submit?

Yeah, this could be implemented here: https://github.com/Snakemake-Profiles/lsf/blob/1bdd36a7041ae6952ec0278cc0200e5a78842bbe/%7B%7Bcookiecutter.profile_name%7D%7D/OSLayer.py#L39-L46

I guess with a third optional parameter timeout that is the time to wait if for a retry if the first invocation fails. Could also have 2 optional parameters, retries and timeout: number of times to retry and timeout. By default is 0 retries and 0 timeout (i.e. just try once immediately). You'd set different values to the defaults when calling bsub: https://github.com/Snakemake-Profiles/lsf/blob/1bdd36a7041ae6952ec0278cc0200e5a78842bbe/%7B%7Bcookiecutter.profile_name%7D%7D/lsf_submit.py#L216

I am wondering if is ok for you to implement to check if this actually solves the issue and possibly submit a PR fixing this?

mbhall88 commented 1 year ago

Sorry I accidently merged a possible fix. Let's hope it works. Did you want to test it out @W-L? I've also been having the same issue on codon recently and it has been driving me crazy. Let's hope this is the fix

W-L commented 1 year ago

Thanks for the quick replies guys! I'm afraid I think my suspicion was incorrect. I had dotted a few Path(self.jobscript).exists() around yesterday and got True even for the randomly failing jobs. But then again, I'm not sure how the filesystem latency comes about and how to test for it. As a workaround it seems that setting restart-times: lets my pipelines finish eventually. At least as long as this issue does not affect the same job multiple times in a row. But the logs are still littered with these tracebacks.. I might do some more digging if I have time

mbhall88 commented 1 year ago

Ah okay. Yeah, I have a big pipeline at the moment that keeps being impacted by this and I generally just set restart times to 2 and it normally gets everything finished eventually

hozeren commented 1 year ago

I have the same issue with the code, unfortunately.

mbhall88 commented 1 year ago

Even with the current tip of master?

hozeren commented 1 year ago

yes but I guess problem was on my side - I needed to specify extra group for the HPC I use. However, I used other input repo, old version of this.

mike2vandy commented 4 months ago

Hello, I'm getting a similar error, and it looks like specific jobs aren't submitted and no err / out files are produced for the rule(s)...? I've used the slurm profile with this snakemake workflow successfully, but I had to switch to a new HPC which uses lsf. The workflow currently requires snakemake 7. The mostly default profile parameters (changes were LSF_UNIT_FOR_LIMITS = 3, restart_times = 2, jobs=300) worked with a simple touch test, but threw errors with the more complex workflow. I'm not really sure where to start except to provide the error. Let me know if more information is needed or which parameters are worth experimenting with.

Traceback (most recent call last):
  File "/gpfs_common/share01/stern/mwvandew/jrst_tst/jrst/Stern_5269/UU_Cfam_GSD_1.0_ROSY/lsf.go_wags/lsf_submit.py", line 242, in submit
    external_job_id = self._submit_cmd_and_get_external_job_id()
  File "/gpfs_common/share01/stern/mwvandew/jrst_tst/jrst/Stern_5269/UU_Cfam_GSD_1.0_ROSY/lsf.go_wags/lsf_submit.py", line 228, in _submit_cmd_and_get_external_job_id
    output_stream, error_stream = OSLayer.run_process(self.submit_cmd)
  File "/gpfs_common/share01/stern/mwvandew/jrst_tst/jrst/Stern_5269/UU_Cfam_GSD_1.0_ROSY/lsf.go_wags/OSLayer.py", line 40, in run_process
    completed_process = subprocess.run(
  File "/usr/local/usrapps/stern/mwvandew/conda/envs/snakemake/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'bsub -M 40 -n 4 -R 'select[mem>40] rusage[mem=40] span[hosts=1]' -W 400 -o "logs/cluster/fastqs_to_ubam/bucket=map_tst.breed=jrst.sample_name=Stern_5269.ref=UU_Cfam_GSD_1.0_ROSY.readgroup_name=Stern_5269_A/jobid13_6456f528-a963-4cc4-8e2f-ce3a3fcf6149.out" -e "logs/cluster/fastqs_to_ubam/bucket=map_tst.breed=jrst.sample_name=Stern_5269.ref=UU_Cfam_GSD_1.0_ROSY.readgroup_name=Stern_5269_A/jobid13_6456f528-a963-4cc4-8e2f-ce3a3fcf6149.err" -J "fastqs_to_ubam.bucket=map_tst.breed=jrst.sample_name=Stern_5269.ref=UU_Cfam_GSD_1.0_ROSY.readgroup_name=Stern_5269_A" -q serial /gpfs_common/share01/stern/mwvandew/jrst_tst/jrst/Stern_5269/UU_Cfam_GSD_1.0_ROSY/.snakemake/tmp.bamk4xzx/snakejob.fastqs_to_ubam.13.sh' returned non-zero exit status 255.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs_common/share01/stern/mwvandew/jrst_tst/jrst/Stern_5269/UU_Cfam_GSD_1.0_ROSY/lsf.go_wags/lsf_submit.py", line 271, in <module>
    lsf_submit.submit()
  File "/gpfs_common/share01/stern/mwvandew/jrst_tst/jrst/Stern_5269/UU_Cfam_GSD_1.0_ROSY/lsf.go_wags/lsf_submit.py", line 248, in submit
    raise BsubInvocationError(error)
__main__.BsubInvocationError: Command 'bsub -M 40 -n 4 -R 'select[mem>40] rusage[mem=40] span[hosts=1]' -W 400 -o "logs/cluster/fastqs_to_ubam/bucket=map_tst.breed=jrst.sample_name=Stern_5269.ref=UU_Cfam_GSD_1.0_ROSY.readgroup_name=Stern_5269_A/jobid13_6456f528-a963-4cc4-8e2f-ce3a3fcf6149.out" -e "logs/cluster/fastqs_to_ubam/bucket=map_tst.breed=jrst.sample_name=Stern_5269.ref=UU_Cfam_GSD_1.0_ROSY.readgroup_name=Stern_5269_A/jobid13_6456f528-a963-4cc4-8e2f-ce3a3fcf6149.err" -J "fastqs_to_ubam.bucket=map_tst.breed=jrst.sample_name=Stern_5269.ref=UU_Cfam_GSD_1.0_ROSY.readgroup_name=Stern_5269_A" -q serial /gpfs_common/share01/stern/mwvandew/jrst_tst/jrst/Stern_5269/UU_Cfam_GSD_1.0_ROSY/.snakemake/tmp.bamk4xzx/snakejob.fastqs_to_ubam.13.sh' returned non-zero exit status 255.
Error submitting jobscript (exit code 1):
dlaehnemann commented 4 months ago

Just from this error message, I'm not sure what causes this. But here are three ideas of things you could check:

  1. Make sure you are using the latest version of the master branch, as suggested above. It contains this waiting mechanism for the jobscript, so maybe this helps already?
  2. Try to run snakemake with --verbose to hopefully get a more detailed error message from the traceback?
  3. Try to run the bsub command manually, maybe with some kind of dummy jobscript that you create. Then you can maybe get a more detailed error message.
mike2vandy commented 4 months ago

Thank you, I think I got it. It was an HPC specific thing. I set a default queue, but that queue only accepts single processor jobs, it threw a hissy fit on rules that asked for > 1 processor. Turns out my HPC recommends no specific queue name and chooses the best queue given the resources requested. The 3rd idea helped me find the problem. Thanks again.