SvenMarcus / hpc-rocket

https://svenmarcus.github.io/hpc-rocket/
MIT License
26 stars 3 forks source link

Work in progress: Support PBS scheduler #50

Open ignatiusm opened 4 months ago

ignatiusm commented 4 months ago

Related to issue #37.

Thanks for developing HPC Rocket! I'd like to help with adding features (specifically around adding PBS support). Before I go too far, I thought I'd submit this draft Pull Request to start a discussion.

This PR so far includes:

As noted in the title, this is very much 'work in progress'. Please excuse the partial implementation (particularly for config parsing, and conditions in hpcrocket/core/application.py). I wanted to get something started and then create the PR so we had a place to discuss things.

Please let me know if you are interested in contributions from the other side of the world, and if so, feel free to make any suggestions for how best to proceed. :)

SvenMarcus commented 4 months ago

Thanks a lot for the contribution! Work towards supporting PBS is very welcome as I don't have access to a cluster using it and haven't the time to set up a test environment myself.

I'll review your PR within the next week and will get back to you as soon as I can!

ignatiusm commented 4 months ago

Yay! I'm pleased this PR is welcome. I realise that I need to get the tests passing again, so I'll do some more work on that.

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

SvenMarcus commented 4 months ago

Yay! I'm pleased this PR is welcome. I realise that I need to get the tests passing again, so I'll do some more work on that.

I have a pretty good idea of what needs to be done for that, but it'll require some work to build that flexibility and still satisfy mypy. I think I can work on getting the existing tests working today, but we still need some for the PbsController and PbsBatchJob. Could you implement some test cases for those classes? We don't need a real PBS system for that, if you have a look in the test folder, there is a slurmoutput sub directory containing examples of common slurm output. It would be great if you could provide something similar for PBS.

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

That would be amazing! I've had some trouble with this guide, but maybe that's because I'm working with an Apple Silicon processor. I think we can also get around implementing the local executor if we simply enable the SSH daemon in the container.

SvenMarcus commented 4 months ago

I extended your PR to make the tests pass. Please rebase onto this branch before making more changes: https://github.com/SvenMarcus/hpc-rocket/tree/Garvan-Data-Science-Platform-add-pbs-controller

There are lots of changed files, but the changes mostly deal with introducing generalized Scheduler, BatchJob, JobStatus and TaskStatus classes and protocols (now found in the new core/schedulers package). After refactoring a bit, PbsBatchJob and SlurmBatchJob turned out to be exactly the same, so I deleted those and we can simply use the generalized BatchJob class now.

I have also slightly changed your proposed format of the config file by renaming file to script. It is now also a yaml object instead of an array.

job:
    script: jobfile.sh
    scheduler: slurm

That's because hpc-rocket already supported this (undocumented) format to avoid adding an extra copy instruction for the slurm script:

job:
   script: remote_jobfile.sh
   from: local_jobfile_path.sh
ignatiusm commented 4 months ago

Hi @SvenMarcus ,

Also, I know a good source for dockerised PBS set up. Let me know if you'd like the details (I guess that would require #33 to be resolved first).

That would be amazing! I've had some trouble with this guide, but maybe that's because I'm working with an Apple Silicon processor. I think we can also get around implementing the local executor if we simply enable the SSH daemon in the container.

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

SvenMarcus commented 4 months ago

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

Thanks! I'll have a look at it as soon as I find some time. That might take a while, though. But I think we can still move ahead with the PR if we add some tests with dummy PBS output.

ignatiusm commented 4 months ago

I have borrowed the set up from the dask-jobqueue repo ci test set up. They have PBS, SLURM and SGE configurations. They are a bit more complex than is needed for here, but I've typically just followed the steps in the jobqueue_before_install function in pbs.sh. I haven't attempted to simplify this more yet.

Thanks! I'll have a look at it as soon as I find some time. That might take a while, though. But I think we can still move ahead with the PR if we add some tests with dummy PBS output.

Apologies for the super slow progress with this. I've been caught up in other things too alas. I've pushed some dummy pbs output. I'll work on tests early next week.

SvenMarcus commented 3 months ago

Apologies for the super slow progress with this. I've been caught up in other things too alas. I've pushed some dummy pbs output. I'll work on tests early next week.

No need to rush, take all the time you need! I really appreciate that you're taking the time to contribute to hpc-rocket! 🙂

Let me know if you run into any trouble getting things working!