PySlurm / pyslurm

Python Interface to Slurm
https://pyslurm.github.io
GNU General Public License v2.0
474 stars 116 forks source link

Wait method for jobs / higher level job API #240

Closed JonaOtto closed 1 year ago

JonaOtto commented 2 years ago

Hello pyslurm developers, I work on an HPC performance tool for my university. We want to enable the tool to dispatch measurement executions of a target code to our cluster, which uses SLURM. Ideally, we want to use pyslurm for this. What we need is a way to:

  1. Dispatch jobs to the cluster: Already possible with job.submit_batch_job.
  2. Wait for a job to finish, so that we can examine the results. So ideally something like a blocking method job.wait(job_id) would be nice, which you could call to wait for a job (referenced by the job_id) to finish. I'm a pyslurm newbie, but as far as I understand, there is no such thing in pyslurm at the moment. As far as I understand there would be several possibilities building such behavior with some combinations of the find, find_id and get methods from the job class.

How do you think would be the approach to do this? Would you think it would be applicable to build such behavior into pyslurm? Or that this is a thing that our tool should care about?

I have to dive deeper into the code, but if there is a thing on this topic I can help with, I would be happy to do so. Generally, we would like to offer to contribute back our knowledge we may obtain during the process, if it is in code or not. It would maybe also be a possibility just to see how it turns out on our side, and we contribute back our code/interface we developed, or even just some comments for others on how we did it.

Thanks for doing this great project, I'm exited to hear your thoughts!

Best, Jonathan

tazend commented 2 years ago

Hi @JonaOtto

For the Job API, I'm currently working in #224 to rework the whole API structure a bit, to support more features and to hopefully make it easier to interact with the job interface, i.e supporting more methods like cancel, update, suspend, hold and so on... But there are still some things to do until it's done :)

Anyway, for your specific problem right now with the current codebase: sbatch also has the --wait flag which blocks until the job terminates. So I just had a look at how they do this here They basically continously fetch the data for a specific job-id and check if the job is in a finished state.

This could easily be replicated in pyslurm I guess, making a function as you said wait (or wait_finished) which wraps around the functionality of find_id (which does slurm_load_job) and simply stays in a while-loop until it is determined that the job has actually finished (using the IS_JOB_FINISHED(job) macro) - at which point the blocking is released.

I could take a look at this when I have the time, otherwise if you want to give it a try and do a PR afterwards, go ahead :)

JonaOtto commented 2 years ago

Hi @tazend, Thanks for the input! I did not know that sbatch can do this. I will think about it, and see what I end up doing. That's either doing it with pyslurm, or given we really need this small fraction of the whole API, we could also just do sbatch and using this --wait flag I guess. In case I do it in pyslurm, you will get my PR (probably in the next week). In case we decide to use the flag, I will close this issue.