Snakemake-Profiles / lsf

Snakemake profile for running jobs on an LSF cluster
MIT License
36 stars 23 forks source link

Updating WAIT_BETWEEN_TRIES to 0.001 and TRY_TIMES to 1 in lsf_status.py #6

Closed leoisl closed 4 years ago

leoisl commented 4 years ago

This speeds up the job submission a lot (from ~20 jobs being run simultaneously to ~200). This seems to be essential, as bjobs seems to fail only on large pipelines, and in such pipelines if we have only ~20 jobs running simultaneously, it will take a very long time to complete.

It seems that the job submission rate is somehow correlated with the time snakemake spends checking the job status. With the default values (WAIT_BETWEEN_TRIES=5 and TRY_TIMES=3), the script spends at least 15 seconds + time to check the LSF log in case bjobs fail. With these new parameters, it spends 0.001 second + time to check the LSF log.

I have also a guess that once bjobs fails for a given job, then bjobs start to fail for many jobs after this one, with the default parameters for WAIT_BETWEEN_TRIES and TRY_TIMES. If bjobs fails for a first job, then snakemake status check takes a long time (15 seconds + time to check the LSF log). In the meantime, many jobs have completed and have the chance of being excluded from LSF's recent history (if bjobs starts to fail for many jobs, then I guess the time between the completion of a subsequent job and the time that snakemake checks its status just increases, and thus bjobs starts to fail for all pipeline). In one pipeline execution, bjobs was failing for all jobs, and when the defaults were changed to WAIT_BETWEEN_TRIES=0.001 and TRY_TIMES=1, bjobs started working again after some time.

Also, from what I have been seeing, if the first try does not work, the second and third very probably do not work either (I never saw them work, but did not look at many cases), i.e. bjobs failing is almost always due to the job being excluded from LSF's recent history. It will certainly not work a second and third time. It seems that once it first fails, it is better to go check the log directly.

The hope with this PR is that if bjobs fail for a job, snakemake does not take a long time checking its status, and the time between the completion of a subsequent job and the time that snakemake checks its status does not snowball.

Sorry that most of what I said are mainly guesses based on my use of this profile, this is hard to reproduce.