Open QuantumChemist opened 5 months ago
I have no example at hand now but the error I retrieved via jf job info id
is something along the lines of "submission limit reached"
Hi, I am not sure if this will solve your issue, but it is possible to set a maximum number of jobs submitted to a worker: https://matgenix.github.io/jobflow-remote/user/advancedoptions.html#limiting-the-number-of-submitted-jobs. However this will only keep track of the jobs submitted by jobflow and will not take into account those submitted manually. Would this help or would you need something that keeps track of the total number of jobs in the cluster instead?
Hi, I am not sure if this will solve your issue, but it is possible to set a maximum number of jobs submitted to a worker: https://matgenix.github.io/jobflow-remote/user/advancedoptions.html#limiting-the-number-of-submitted-jobs. However this will only keep track of the jobs submitted by jobflow and will not take into account those submitted manually. Would this help or would you need something that keeps track of the total number of jobs in the cluster instead?
Hi! I have set this max number of jobs already, because I have way more jobs than the limit. Unfortunately it doesn't help, so I would really appreciate it if there would be a way to keep track of all the jobs in the cluster (and a certain queue).
I see. Indeed I had the doubt this could be the case. Unfortunately at the moment there is no way of enforcing such constraint. One of the issues is that there is not a strict link between a worker and the resources used by each job. At the moment the implementation relies on informations that are all known from jobflow-remote's runner or DB. The implementation will thus imply adding some ad hoc configuration parameter and rely on the user assigning jobs to the proper worker. I will think about the best way of adding this feature.
Oh I see. This sounds indeed not so easy to implement. I can maybe get it working by playing around with the RunnerOption like step_attempts and get_delta_retry etc.
Hi 😀
we have a submission limit of 40 or 20 jobs (depending on the queue of our HPC cluster) and when I start some other additional VASP jobs manually, that leads to reaching that limit, the submission of the next jobs from the jobflow-remote queue fail and go into
REMOTE_ERROR
state. Therefore, I have to retry the jobs when I'm below the limit again, but then I cannot let the jobs run over night or over the weekend and have to constantly watch the workflow. I'm using the interactive branch.Do you have an idea how to solve this problem? I temporary solved it by using a bash line that is retrying all jobs with
REMOTE_ERROR
state every few hours, but it's not really a desirable solution. Did you ever face a similar issue?