Snakemake-Profiles / lsf

Snakemake profile for running jobs on an LSF cluster
MIT License
36 stars 22 forks source link

Change `max_status_checks` to number >1 #48

Closed kpj closed 2 years ago

kpj commented 2 years ago

lsf_status.py is sometimes unable to retrieve the status of a running job and fails with

[Predicted exception] Error calling bjobs: Command 'bjobs -o 'stat' -noheader 212651182' returned non-zero exit status 255.

By default, this is tried once and then makes Snakemake crash. Changing the default value from 1 to, e.g., 10 in the line https://github.com/Snakemake-Profiles/lsf/blob/34c3c4c462d3a2070643a00033815f30bfd105e0/%7B%7Bcookiecutter.profile_name%7D%7D/lsf_status.py#L52 fixed this issue for me.

Is there a reason why this is set to 1 by default although the retry mechanism exists? How about setting it to a higher number or making this configurable via the cookiecutter template.

mbhall88 commented 2 years ago

@leoisl it might be good to expose this paramter when setting up the profile? Could add it to #43

leoisl commented 2 years ago

Hello,

could you please detail what is the error that makes snakemake crash? We expect that bjobs fails sometimes, and in this case, as a backup, we look at the log for the job to see what happened with the job. This has worked for us so far, but I think you are hitting a case where snakemake is crashing that we have not seen before?

These default values were set in the past due to my experience with the EBI cluster. Before this change, the default values were: WAIT_BETWEEN_TRIES: float = 5, TRY_TIMES: int = 3. This would make the job submission much slower (some ~10x slower), and if the first bjobs command did not work, the other tries almost never worked (in my small experience). See this commit message. IIRC, snakemake works serially (not in parallel) checking jobs and submitting jobs. With a single try (which is the default now), the job checking is fast: we run bjobs, if it fails we look at the log and we are done. With multiple tries, for example in the past: 3 tries with 5-second wait between them, then job checking becomes very slow when bjobs fail - effectively 10 seconds or more slower (due to two 5-second waits). This slows down the whole pipeline execution, as the rate of job submissions will decrease a lot due to slow job checking. However, this is pipeline-dependent: I was running a pipeline with thousands of short jobs, so this slow down was significant. I also have no idea if this is still true, as snakemake improved a lot since then, as well as our cluster and LSF. I also think it is not a good idea to set unconfigurable defaults based on my experience, as it seems my experience does not generalise, so I totally agree with this change, but we need to be careful when setting these values.

We actually already have the max_status_checks_per_second parameter in cookie-cutter: https://github.com/Snakemake-Profiles/lsf/blob/34c3c4c462d3a2070643a00033815f30bfd105e0/cookiecutter.json#L15

But we don't use it at all in the code. I'd like to also add a wait_between_status_checks_in_seconds to the cookie cutter, so both parameters are configurable. I think we should still keep the defaults as max_status_checks_per_second=1 and wait_between_status_checks_in_seconds=0.001 in the cookie cutter, but at least these now are configurable. I also think we should add a warning in the README that increasing these values might lead to a slow down of the pipeline execution (like 3 tries and 5 seconds between tries decreased the pipeline execution by ~10x in my use case)... but for some clusters this might be a good idea (e.g. for @kpj use case).

I'd prefer to not have this in https://github.com/Snakemake-Profiles/lsf/pull/43 , as https://github.com/Snakemake-Profiles/lsf/pull/43 is already doing two things, but I will leave the decision to you.

kpj commented 2 years ago

Thanks a lot for your thorough reply!

could you please detail what is the error that makes snakemake crash? We expect that bjobs fails sometimes, and in this case, as a backup, we look at the log for the job to see what happened with the job. This has worked for us so far, but I think you are hitting a case where snakemake is crashing that we have not seen before?

Using the word "crash" was probably a poor choice on my side. What I meant to indicate was that Snakemake thinks the job (which is actually still running) has failed and thus shuts itself down while the job is continuing to run. This seems to be the case because the call to bjobs fails and no log file is found (it has not been written because the job is still running).

For example, the log can look like this:

[..]
[Predicted exception] Error calling bjobs: Command 'bjobs -o 'stat' -noheader 200800215' returned non-zero exit status 255.
Resuming...
bjobs failed 1 times. Checking log...
Log file logs/cluster/run_method/category=C2.subcategory=None.alpha=0.4.beta=0.25.similarityfactor=1.0.ontermcount=50.siggenescaling=0.1.replicate=19.method=pareg/jobid39768_87801c7b-36c4-482c-b5d0-2d470ceb5d68.out not found
[Tue Jan 18 10:00:50 2022]
Error in rule run_method:
    jobid: 39768
    output: results/enrichments/category~C2_subcategory~None_alpha~0.4_beta~0.25_similarityfactor~1.0_ontermcount~50_siggenescaling~0.1/replicates/19/pareg/enrichment_result.csv
    cluster_jobid: 200800215 logs/cluster/run_method/category=C2.subcategory=None.alpha=0.4.beta=0.25.similarityfactor=1.0.ontermcount=50.siggenescaling=0.1.replicate=19.method=pareg/jobid39768_87801c7b-36c4-482c-b5d0-2d470ceb5d68.out

Error executing rule run_method on cluster (jobid: 39768, external: 200800215 logs/cluster/run_method/category=C2.subcategory=None.alpha=0.4.beta=0.25.similarityfactor=1.0.ontermcount=50.siggenescaling=0.1.replicate=19.method=pareg/jobid39768_87801c7b-36c4-482c-b5d0-2d470ceb5d68.out, jobscript: /cluster/work/bewi/members/kimja/projects/pareg/inst/scripts/.snakemake/tmp.kg32ienc/[snakejob.run_method.39768.sh](http://snakejob.run_method.39768.sh/)). For error details see the cluster log and the log files of the involved rule(s).
[..]

We actually already have the max_status_checks_per_second parameter in cookie-cutter: https://github.com/Snakemake-Profiles/lsf/blob/34c3c4c462d3a2070643a00033815f30bfd105e0/cookiecutter.json#L15. But we don't use it at all in the code. I'd like to also add a wait_between_status_checks_in_seconds to the cookie cutter, so both parameters are configurable.

I totally agree with your concerns and think it would be great to have these two parameters in cookiecutter.json.

leoisl commented 2 years ago

Using the word "crash" was probably a poor choice on my side. What I meant to indicate was that Snakemake thinks the job (which is actually still running) has failed and thus shuts itself down while the job is continuing to run. This seems to be the case because the call to bjobs fails and no log file is found (it has not been written because the job is still running).

This is interesting, and definitely a use case we did not expect, and thus not addressed it. It seems that in our environment, whenever a job started running, the log would be created (even if empty), so we assumed the log would always be present if the job is running (or at least was like this). As such, we actually choose to fail if the log file is not present, but your use case shows that this is not correct and needs to be fixed. I think we need a better way to track the job status when bjobs fail, not relying that the log file will be always present. Thanks for the bug report, this is crucial for the pipeline execution, and will be fixed after the current PRs are merged.

mbhall88 commented 2 years ago

@leoisl there is also bhist for checking jobs that are older than the current lsb.events file. However I don't think it is helpful in this particular issue.

I think given the example in this issue, increasing the default number of tries, as well as exposing the parameter is probably sufficient?

leoisl commented 2 years ago

@leoisl there is also bhist for checking jobs that are older than the current lsb.events file. However I don't think it is helpful in this particular issue.

I think we should probably call bhist in case bjobs fail before checking the log. Maybe not in this PR, but in a future one. What do you think?

I think given the example in this issue, increasing the default number of tries, as well as exposing the parameter is probably sufficient?

Yeah, exposing wait_between_status_checks_in_seconds and max_status_checks is a must. IDK about the default params though, given the slow down I had in a previous pipeline by having 3 tries and 5 seconds between them. Maybe 3 tries and 0.5 seconds or a small time interval is enough. I was also thinking about touching self.outlog just before submitting the job. As such, our assumption that the log file will always be present (even if empty) when the job is submitted is always true. This last change itself would solve this issue, as then the LSF profile would not fail by not finding the log file and the job would eventually finish and we would be able to verify that it completed successfully (either with bjobs or looking at the log file).

Thus my proposal for this issue is:

  1. Expose wait_between_status_checks_in_seconds and max_status_checks;
  2. Touch log file before submitting job;
  3. Create an issue to query bhist if bjobs fails, before querying the log file;

Please tell me if you agree with this proposal or if you suggest sth else

leoisl commented 2 years ago

Note: do not change max_status_checks_per_second in cookie-cutter: this sets the default --max-status-checks-per-second parameter in snakemake. To make the parameters less ambiguous, let's add two parameters to the cookie-cutter: max_bjobs_tries and wait_between_bjobs_tries_in_seconds

mbhall88 commented 2 years ago

I think we should probably call bhist in case bjobs fail before checking the log. Maybe not in this PR, but in a future one. What do you think?

My experience with bhist is mixed. I can be very slow as it opens all the historical lsb.events files in reverse chronological order - up to the last 100. So in the worst case we open 100 files. Using the log opens 1 file in the worst case always. I just mentioned bhist for incase the log doesn't work. However, in the example this issue was created for bhist will also fail as the job isn't even showing up yet.

IDK about the default params though, given the slow down I had in a previous pipeline by having 3 tries and 5 seconds between them. Maybe 3 tries and 0.5 seconds or a small time interval is enough.

Ok, leave the default as is and by exposing the params we let users make their own mind up. Might be good to document the potential slow down you mentioned though.

As to your three proposals:

  1. agreed
  2. hmmm this seems like a bit of overhead to handle a corner case?
  3. i think in the case the log fails then use bhist - i.e. use bhist last?