deepmodeling / dpdispatcher

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish
https://docs.deepmodeling.com/projects/dpdispatcher/
GNU Lesser General Public License v3.0
42 stars 56 forks source link

Implementation for jump connection #426

Closed scott-5 closed 1 month ago

scott-5 commented 7 months ago

Can the DPDispatcher program implement jump connection? If not, that is a good feature.

Some possible application scenarios : a program runs on a cluster(like dpgen), and then only the master node can connect to other resources, but the memory of the master node is not enough to maintain the program running. If the program is submitted to the computing node, but because the computing node is generally unable to connect to the network, it is impossible to distribute tasks to other resources from the computing node.

Therefore, if the master node can be used as a stepping stone for the computing node, the problem will be solved.

felix5572 commented 2 months ago

@scott-5 It may be strange that the memory of the master node is not enough to maintain the program running. Usually the dpgen process and the dpdispatcher process consume only a little memory. (And if OOM, I would rather 1. check the implementation of this software. 2. split the tasks into pieces. 3.Use better login in node)

njzjz commented 1 month ago

I think @thangckt has supported this feature in #475.

scott-5 commented 1 month ago

@scott-5 It may be strange that the memory of the master node is not enough to maintain the program running. Usually the dpgen process and the dpdispatcher process consume only a little memory. (And if OOM, I would rather 1. check the implementation of this software. 2. split the tasks into pieces. 3.Use better login in node) Thanks for your advises. I'll check again.

scott-5 commented 1 month ago

I think @thangckt has supported this feature in #475.

It seems feasible. I'll close this issue before meeting anything unexpected.