Open ychiat35 opened 1 year ago
Hi, thanks for contributing this.
The code looks good to me, but I don't have access to an OAR cluster to test it out, and won't have the knowledge to answer questions about OAR if users have issues.
So I'd rather have this code in a separate repository.
submitit
does in fact have a plugin system that allows that. The process isn't documented because you're actually the first external user to make such a PR, but we already have a Meta internal plugin.
The steps to follow are the following:
setup.py
declare an entry point setup(
name="submitit_oar",
install_requires=["submitit>=1.4.6"],
...
entry_points={
"submitit": "\n".join(
[
"",
"executor = submitit_oar:OarExecutor",
"job_environment = submitit_oar:OarJobEnvironment",
"",
]
)
},
zip_safe=False,
)
pip install -e .
to install your OAR pluginimport submitit; ex = submitit.AutoExecutor(cluster="oar")
If all work well, we can add an entry in the readme that point to your plugin.
Hello, thanks for your review and your proposal about the plugin. Here is the repository: https://github.com/ychiat35/submitit_oar. I will try to add some CI/CD actions for tests and package releases.
About this point:
The code looks good to me, but I don't have access to an OAR cluster to test it out, and won't have the knowledge to answer questions about OAR if users have issues.
have you thinked about some CI tests for OAR (and Slurm), similarly to what is done for Slurm and SGE clusters on Dask-jobqueue repository: https://github.com/dask/dask-jobqueue/blob/main/ci/slurm/docker-compose.yml ? maybe it will be a good way to test real jobs launched on OAR/Slurm clusters.
Hello,
We'd like to inform you that we have successfully integrated the submitit_oar plugin into the Grid5000 repositories, at this link: Grid5000/submitit_oar. Additionally, we have released a new version of the plugin on PyPi, accessible here: submitit_oar 1.1.1.
The integration of the submitit_oar plugin has been smooth, and it seamlessly aligns with the Submitit's plugin system.
To finalize the pull request, we'd like to confirm if you're still fine with us submitting a PR to update the readme to mention our plugin.
Thanks a lot for your feedback.
The Oar scheduler is widely used in France, including mesocentre supercomputers (e.g., GRICAD), INRIA supercomputers, Grid5000 testbed and other plateforms.
This PR adds support for the OAR Scheduler as a plugin. Four main classes have been implemented in
oar.py
(following the previous implementation made for slurm):oarstat
command (similar to thesinfo
command on the Slurm scheduler).Unit tests were created in
test_oar.py
andtest_auto.py
to ensure that the OAR plugin offers the same basic functionalities as the Slurm plugin.A few notes about the implementation:
_equivalence_dict
dictionary). Additional OAR parameters can be set with theadditional_parameters
dictionary._make_submission_command
method in the OarExecutor class is overridden from PicklingExecutor. The content of the file is read and the job is submitted using the OAR "inline command" instead of using the submission file.scontrol
(i.e.,oarsub
) is not available on nodes. To automatically requeue the job after preemption, the original job must be submitted with theidempotent
type and be exited with the99
code.Our implemented OAR plugin covers most of submitit features (e.g., job submission, checkpointing, job array). The only feature that we did not address is the task submission. Indeed, contrary to Slurm, OAR does not provide such a feature. We believe a workaround could be implemented in another iteration. Meanwhile, we raise a "NotImplemeted" error if a user attempts to use such a feature.