Closed kpj closed 2 years ago
@leoisl it might be good to expose this paramter when setting up the profile? Could add it to #43
Hello,
could you please detail what is the error that makes snakemake
crash? We expect that bjobs
fails sometimes, and in this case, as a backup, we look at the log for the job to see what happened with the job. This has worked for us so far, but I think you are hitting a case where snakemake
is crashing that we have not seen before?
These default values were set in the past due to my experience with the EBI cluster. Before this change, the default values were: WAIT_BETWEEN_TRIES: float = 5, TRY_TIMES: int = 3
. This would make the job submission much slower (some ~10x slower), and if the first bjobs
command did not work, the other tries almost never worked (in my small experience). See this commit message. IIRC, snakemake
works serially (not in parallel) checking jobs and submitting jobs. With a single try (which is the default now), the job checking is fast: we run bjobs
, if it fails we look at the log and we are done. With multiple tries, for example in the past: 3 tries with 5-second wait between them, then job checking becomes very slow when bjobs
fail - effectively 10 seconds or more slower (due to two 5-second waits). This slows down the whole pipeline execution, as the rate of job submissions will decrease a lot due to slow job checking. However, this is pipeline-dependent: I was running a pipeline with thousands of short jobs, so this slow down was significant. I also have no idea if this is still true, as snakemake
improved a lot since then, as well as our cluster and LSF. I also think it is not a good idea to set unconfigurable defaults based on my experience, as it seems my experience does not generalise, so I totally agree with this change, but we need to be careful when setting these values.
We actually already have the max_status_checks_per_second
parameter in cookie-cutter: https://github.com/Snakemake-Profiles/lsf/blob/34c3c4c462d3a2070643a00033815f30bfd105e0/cookiecutter.json#L15
But we don't use it at all in the code. I'd like to also add a wait_between_status_checks_in_seconds
to the cookie cutter, so both parameters are configurable. I think we should still keep the defaults as max_status_checks_per_second=1
and wait_between_status_checks_in_seconds=0.001
in the cookie cutter, but at least these now are configurable. I also think we should add a warning in the README that increasing these values might lead to a slow down of the pipeline execution (like 3 tries and 5 seconds between tries decreased the pipeline execution by ~10x in my use case)... but for some clusters this might be a good idea (e.g. for @kpj use case).
I'd prefer to not have this in https://github.com/Snakemake-Profiles/lsf/pull/43 , as https://github.com/Snakemake-Profiles/lsf/pull/43 is already doing two things, but I will leave the decision to you.
Thanks a lot for your thorough reply!
could you please detail what is the error that makes
snakemake
crash? We expect thatbjobs
fails sometimes, and in this case, as a backup, we look at the log for the job to see what happened with the job. This has worked for us so far, but I think you are hitting a case wheresnakemake
is crashing that we have not seen before?
Using the word "crash" was probably a poor choice on my side. What I meant to indicate was that Snakemake thinks the job (which is actually still running) has failed and thus shuts itself down while the job is continuing to run. This seems to be the case because the call to bjobs
fails and no log file is found (it has not been written because the job is still running).
For example, the log can look like this:
[..]
[Predicted exception] Error calling bjobs: Command 'bjobs -o 'stat' -noheader 200800215' returned non-zero exit status 255.
Resuming...
bjobs failed 1 times. Checking log...
Log file logs/cluster/run_method/category=C2.subcategory=None.alpha=0.4.beta=0.25.similarityfactor=1.0.ontermcount=50.siggenescaling=0.1.replicate=19.method=pareg/jobid39768_87801c7b-36c4-482c-b5d0-2d470ceb5d68.out not found
[Tue Jan 18 10:00:50 2022]
Error in rule run_method:
jobid: 39768
output: results/enrichments/category~C2_subcategory~None_alpha~0.4_beta~0.25_similarityfactor~1.0_ontermcount~50_siggenescaling~0.1/replicates/19/pareg/enrichment_result.csv
cluster_jobid: 200800215 logs/cluster/run_method/category=C2.subcategory=None.alpha=0.4.beta=0.25.similarityfactor=1.0.ontermcount=50.siggenescaling=0.1.replicate=19.method=pareg/jobid39768_87801c7b-36c4-482c-b5d0-2d470ceb5d68.out
Error executing rule run_method on cluster (jobid: 39768, external: 200800215 logs/cluster/run_method/category=C2.subcategory=None.alpha=0.4.beta=0.25.similarityfactor=1.0.ontermcount=50.siggenescaling=0.1.replicate=19.method=pareg/jobid39768_87801c7b-36c4-482c-b5d0-2d470ceb5d68.out, jobscript: /cluster/work/bewi/members/kimja/projects/pareg/inst/scripts/.snakemake/tmp.kg32ienc/[snakejob.run_method.39768.sh](http://snakejob.run_method.39768.sh/)). For error details see the cluster log and the log files of the involved rule(s).
[..]
We actually already have the max_status_checks_per_second parameter in cookie-cutter: https://github.com/Snakemake-Profiles/lsf/blob/34c3c4c462d3a2070643a00033815f30bfd105e0/cookiecutter.json#L15. But we don't use it at all in the code. I'd like to also add a wait_between_status_checks_in_seconds to the cookie cutter, so both parameters are configurable.
I totally agree with your concerns and think it would be great to have these two parameters in cookiecutter.json
.
Using the word "crash" was probably a poor choice on my side. What I meant to indicate was that Snakemake thinks the job (which is actually still running) has failed and thus shuts itself down while the job is continuing to run. This seems to be the case because the call to
bjobs
fails and no log file is found (it has not been written because the job is still running).
This is interesting, and definitely a use case we did not expect, and thus not addressed it. It seems that in our environment, whenever a job started running, the log would be created (even if empty), so we assumed the log would always be present if the job is running (or at least was like this). As such, we actually choose to fail if the log file is not present, but your use case shows that this is not correct and needs to be fixed. I think we need a better way to track the job status when bjobs
fail, not relying that the log file will be always present. Thanks for the bug report, this is crucial for the pipeline execution, and will be fixed after the current PRs are merged.
@leoisl there is also bhist
for checking jobs that are older than the current lsb.events
file. However I don't think it is helpful in this particular issue.
I think given the example in this issue, increasing the default number of tries, as well as exposing the parameter is probably sufficient?
@leoisl there is also
bhist
for checking jobs that are older than the currentlsb.events
file. However I don't think it is helpful in this particular issue.
I think we should probably call bhist
in case bjobs
fail before checking the log. Maybe not in this PR, but in a future one. What do you think?
I think given the example in this issue, increasing the default number of tries, as well as exposing the parameter is probably sufficient?
Yeah, exposing wait_between_status_checks_in_seconds
and max_status_checks
is a must. IDK about the default params though, given the slow down I had in a previous pipeline by having 3 tries and 5 seconds between them. Maybe 3 tries and 0.5 seconds or a small time interval is enough. I was also thinking about touching self.outlog
just before submitting the job. As such, our assumption that the log file will always be present (even if empty) when the job is submitted is always true. This last change itself would solve this issue, as then the LSF profile would not fail by not finding the log file and the job would eventually finish and we would be able to verify that it completed successfully (either with bjobs
or looking at the log file).
Thus my proposal for this issue is:
wait_between_status_checks_in_seconds
and max_status_checks
;bhist
if bjobs
fails, before querying the log file;Please tell me if you agree with this proposal or if you suggest sth else
Note: do not change max_status_checks_per_second
in cookie-cutter: this sets the default --max-status-checks-per-second
parameter in snakemake
. To make the parameters less ambiguous, let's add two parameters to the cookie-cutter: max_bjobs_tries
and wait_between_bjobs_tries_in_seconds
I think we should probably call bhist in case bjobs fail before checking the log. Maybe not in this PR, but in a future one. What do you think?
My experience with bhist
is mixed. I can be very slow as it opens all the historical lsb.events
files in reverse chronological order - up to the last 100. So in the worst case we open 100 files. Using the log opens 1 file in the worst case always. I just mentioned bhist
for incase the log doesn't work. However, in the example this issue was created for bhist
will also fail as the job isn't even showing up yet.
IDK about the default params though, given the slow down I had in a previous pipeline by having 3 tries and 5 seconds between them. Maybe 3 tries and 0.5 seconds or a small time interval is enough.
Ok, leave the default as is and by exposing the params we let users make their own mind up. Might be good to document the potential slow down you mentioned though.
As to your three proposals:
bhist
- i.e. use bhist
last?
lsf_status.py
is sometimes unable to retrieve the status of a running job and fails withBy default, this is tried once and then makes Snakemake crash. Changing the default value from 1 to, e.g., 10 in the line https://github.com/Snakemake-Profiles/lsf/blob/34c3c4c462d3a2070643a00033815f30bfd105e0/%7B%7Bcookiecutter.profile_name%7D%7D/lsf_status.py#L52 fixed this issue for me.
Is there a reason why this is set to 1 by default although the retry mechanism exists? How about setting it to a higher number or making this configurable via the cookiecutter template.