Closed unkcpz closed 4 months ago
I want to merge the PR first so I can keep on working on support multi-node feature needed by Timo of on using this in demo server deployment. It will not affect current user if package is installed from pypi. We can decide when to release a new one after check if everything is ready. For me, I had it tested and run millions on Eiger and Daint and on demo server. Let me know if you don't agree @mbercx.
Keep it open too long also means need to rework after https://github.com/aiidateam/aiida-core/pull/6043.
Might want to split up the changes in multiple commits, but I would understand if you don't want to go through the hassle. Also fine to squash and merge and simply have a single commit that describes all the changes.
Thanks @mbercx!!
Also fine to squash and merge and simply have a single commit that describes all the changes.
I'll rebase to less commits and reword a bit the commit message and do a rebase merge. I did one rebase before and try to keep every commit independent as possible.
This PR is open since I use the branch to test the demo server lightweight scheduler integration. The PR bundles bunch of things include:
hq
using the fixture from hyperqueue repo.The major change I made in terms of resource setting is I didn't use
num_mpiprocs
and renamenum_cores
->num_cpus
, renamememory_Mb
->memory_mb
. The reason is that I think this kind of "meta-scheduler" for task farming is not inherit from eitherParEnvJobResource
as SGE type scheduler norNodeNumberJobResource
. When we use hyperqueue for task farming or for local machine as light-weight scheduler we only set number of cpus and size of memory to allocate for each job. The multi-node support of hyperqueue is under experiments and will not cover our use case from what I can expect. But this is the point worth to discuss, looking forward to see your opinions @giovannipizzi @mbercxIssues:
OSError: Failure
)HQ_SERVER_DIR
explicitly, to distinguish multiple server (see https://github.com/It4innovations/hyperqueue/issues/719)Must have features:
NodeNumberJobResource
as parent and provide option for use case on LUMI that will require multinode functionality of HQ.-N
is passed to alloc, the group name should be always exclusive. We don't want HQ to mess around to have many unbalanced jobs in different compute nodes.