This PR is open since I use the branch to test the demo server lightweight scheduler integration. The PR bundles bunch of things include:
[x] correctly support memory setup for resources.
[x] support turn on hyperthreading with latest version of hyperqueue.
[x] Use ruff to lint
[x] Fix the submit bug for hq > 0.12 that resources are configured twice in job script and submit command.
[x] Support install hq to remote computer over CLI.
[x] WIP adding unit tests with submit to real hq using the fixture from hyperqueue repo.
The major change I made in terms of resource setting is I didn't use num_mpiprocs and rename num_cores -> num_cpus, rename memory_Mb -> memory_mb.
The reason is that I think this kind of "meta-scheduler" for task farming is not inherit from either ParEnvJobResource as SGE type scheduler nor NodeNumberJobResource. When we use hyperqueue for task farming or for local machine as light-weight scheduler we only set number of cpus and size of memory to allocate for each job. The multi-node support of hyperqueue is under experiments and will not cover our use case from what I can expect. But this is the point worth to discuss, looking forward to see your opinions @giovannipizzi @mbercx
Issues:
[ ] If remote binary exist, cannot override install. Hit sftp error (OSError: Failure)
[ ] Think of is it a new problem after eiger updated, only from the same login node can access the server.
[ ] Use NodeNumberJobResource as parent and provide option for use case on LUMI that will require multinode functionality of HQ.
how to tell alloc to fire workers in the same group? Every new multinode run is managed to a certain group. Which means if -N is passed to alloc, the group name should be always exclusive. We don't want HQ to mess around to have many unbalanced jobs in different compute nodes.
This PR is open since I use the branch to test the demo server lightweight scheduler integration. The PR bundles bunch of things include:
hq
using the fixture from hyperqueue repo.The major change I made in terms of resource setting is I didn't use
num_mpiprocs
and renamenum_cores
->num_cpus
, renamememory_Mb
->memory_mb
. The reason is that I think this kind of "meta-scheduler" for task farming is not inherit from eitherParEnvJobResource
as SGE type scheduler norNodeNumberJobResource
. When we use hyperqueue for task farming or for local machine as light-weight scheduler we only set number of cpus and size of memory to allocate for each job. The multi-node support of hyperqueue is under experiments and will not cover our use case from what I can expect. But this is the point worth to discuss, looking forward to see your opinions @giovannipizzi @mbercxIssues:
OSError: Failure
)HQ_SERVER_DIR
explicitly, to distinguish multiple server (see https://github.com/It4innovations/hyperqueue/issues/719)Must have features:
NodeNumberJobResource
as parent and provide option for use case on LUMI that will require multinode functionality of HQ.-N
is passed to alloc, the group name should be always exclusive. We don't want HQ to mess around to have many unbalanced jobs in different compute nodes.