Additonal manual jobs cause `REMOTE_ERROR` state for jobflow-remote jobs when submission maximum is reached

Matgenix / jobflow-remote

jobflow-remote is a Python package to run jobflow workflows on remote resources.

https://matgenix.github.io/jobflow-remote/

Other

25 stars 11 forks source link

Additonal manual jobs cause `REMOTE_ERROR` state for jobflow-remote jobs when submission maximum is reached #135

Open QuantumChemist opened 5 months ago

QuantumChemist commented 5 months ago

Hi 😀

we have a submission limit of 40 or 20 jobs (depending on the queue of our HPC cluster) and when I start some other additional VASP jobs manually, that leads to reaching that limit, the submission of the next jobs from the jobflow-remote queue fail and go into REMOTE_ERROR state. Therefore, I have to retry the jobs when I'm below the limit again, but then I cannot let the jobs run over night or over the weekend and have to constantly watch the workflow. I'm using the interactive branch.

Do you have an idea how to solve this problem? I temporary solved it by using a bash line that is retrying all jobs with REMOTE_ERROR state every few hours, but it's not really a desirable solution. Did you ever face a similar issue?

QuantumChemist commented 5 months ago

I have no example at hand now but the error I retrieved via jf job info id is something along the lines of "submission limit reached"

gpetretto commented 5 months ago

Hi, I am not sure if this will solve your issue, but it is possible to set a maximum number of jobs submitted to a worker: https://matgenix.github.io/jobflow-remote/user/advancedoptions.html#limiting-the-number-of-submitted-jobs. However this will only keep track of the jobs submitted by jobflow and will not take into account those submitted manually. Would this help or would you need something that keeps track of the total number of jobs in the cluster instead?

QuantumChemist commented 5 months ago

Hi, I am not sure if this will solve your issue, but it is possible to set a maximum number of jobs submitted to a worker: https://matgenix.github.io/jobflow-remote/user/advancedoptions.html#limiting-the-number-of-submitted-jobs. However this will only keep track of the jobs submitted by jobflow and will not take into account those submitted manually. Would this help or would you need something that keeps track of the total number of jobs in the cluster instead?

Hi! I have set this max number of jobs already, because I have way more jobs than the limit. Unfortunately it doesn't help, so I would really appreciate it if there would be a way to keep track of all the jobs in the cluster (and a certain queue).

gpetretto commented 4 months ago

I see. Indeed I had the doubt this could be the case. Unfortunately at the moment there is no way of enforcing such constraint. One of the issues is that there is not a strict link between a worker and the resources used by each job. At the moment the implementation relies on informations that are all known from jobflow-remote's runner or DB. The implementation will thus imply adding some ad hoc configuration parameter and rely on the user assigning jobs to the proper worker. I will think about the best way of adding this feature.

QuantumChemist commented 4 months ago

Oh I see. This sounds indeed not so easy to implement. I can maybe get it working by playing around with the RunnerOption like step_attempts and get_delta_retry etc.