facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm
MIT License
1.3k stars 125 forks source link

Add support for OAR Scheduler #1744

Open ychiat35 opened 1 year ago

ychiat35 commented 1 year ago

The Oar scheduler is widely used in France, including mesocentre supercomputers (e.g., GRICAD), INRIA supercomputers, Grid5000 testbed and other plateforms.

This PR adds support for the OAR Scheduler as a plugin. Four main classes have been implemented in oar.py (following the previous implementation made for slurm):

Unit tests were created in test_oar.py and test_auto.py to ensure that the OAR plugin offers the same basic functionalities as the Slurm plugin.

A few notes about the implementation:

Our implemented OAR plugin covers most of submitit features (e.g., job submission, checkpointing, job array). The only feature that we did not address is the task submission. Indeed, contrary to Slurm, OAR does not provide such a feature. We believe a workaround could be implemented in another iteration. Meanwhile, we raise a "NotImplemeted" error if a user attempts to use such a feature.

gwenzek commented 1 year ago

Hi, thanks for contributing this. The code looks good to me, but I don't have access to an OAR cluster to test it out, and won't have the knowledge to answer questions about OAR if users have issues. So I'd rather have this code in a separate repository. submitit does in fact have a plugin system that allows that. The process isn't documented because you're actually the first external user to make such a PR, but we already have a Meta internal plugin.

The steps to follow are the following:

setup(
    name="submitit_oar",
    install_requires=["submitit>=1.4.6"],
    ...
    entry_points={
        "submitit": "\n".join(
            [
                "",
                "executor = submitit_oar:OarExecutor",
                "job_environment = submitit_oar:OarJobEnvironment",
                "",
            ]
        )
    },
    zip_safe=False,
)

If all work well, we can add an entry in the readme that point to your plugin.

ychiat35 commented 1 year ago

Hello, thanks for your review and your proposal about the plugin. Here is the repository: https://github.com/ychiat35/submitit_oar. I will try to add some CI/CD actions for tests and package releases.

About this point:

The code looks good to me, but I don't have access to an OAR cluster to test it out, and won't have the knowledge to answer questions about OAR if users have issues.

have you thinked about some CI tests for OAR (and Slurm), similarly to what is done for Slurm and SGE clusters on Dask-jobqueue repository: https://github.com/dask/dask-jobqueue/blob/main/ci/slurm/docker-compose.yml ? maybe it will be a good way to test real jobs launched on OAR/Slurm clusters.

ychiat35 commented 8 months ago

Hello,

We'd like to inform you that we have successfully integrated the submitit_oar plugin into the Grid5000 repositories, at this link: Grid5000/submitit_oar. Additionally, we have released a new version of the plugin on PyPi, accessible here: submitit_oar 1.1.1.

The integration of the submitit_oar plugin has been smooth, and it seamlessly aligns with the Submitit's plugin system.

To finalize the pull request, we'd like to confirm if you're still fine with us submitting a PR to update the readme to mention our plugin.

Thanks a lot for your feedback.