deepmodeling / dpdispatcher

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish
https://docs.deepmodeling.com/projects/dpdispatcher/
GNU Lesser General Public License v3.0
42 stars 56 forks source link

Question about max parallel jobs #467

Closed Franklalalala closed 1 month ago

Franklalalala commented 2 months ago

Suppose I have limited computational resources, e.g., one node with 64 cores.

I want to excute 1000 tasks on this node, each task requires 4 cores.

This means I can parallel with 16 jobs, each job will have 63 (62.5) tasks.

Problem is, when there is one problemed job, it will affect the entire batch, which will destroy at most 63 tasks. Besides, this batch will be repeated for 3 times. The entire time for all the excution will be prolonged.

Here we have the strategy parameter, which could end early with sacrifice of a small ratio unfinished jobs, which could still contain the unexcuted ones. Since the excution of tasks in a job is in a sequential order.

Another solution of this problem could be:

I will submit 1000 jobs, each job consists one task. These jobs will be in a queue that allows at most 16 parallel jobs. In this way, we can ensure all the jobs could be iterated while the errous ones will be iterated for 3 times without affect others.

The question is how to realize this feature, e.g. in the context of LazyLocalContext, batch_type is Shell. Or there is an atomated one like slurm, which will take over this job, in this case, how do I triger it in bohrium CPU machine.

njzjz commented 2 months ago

Problem is, when there is one problemed job, it will affect the entire batch, which will destroy at most 63 tasks. Besides, this batch will be repeated for 3 times. The entire time for all the excution will be prolonged.

Successful tasks will NOT be rerun.

Franklalalala commented 1 month ago

There is a parameter called retry counts in resource section to partly address this problem, though doesn't work. More tests are required. See doc. To whom may be concerned, my final solution is a brutal change of source code in handle terminated section. When there raises an error, change the job states to finished. Besides, each job contains one task, which would avoid unwanted packet loss problems.