Wait_finished method for job API (regarding #240)

JonaOtto commented 2 years ago

Hello PySlurm developers, This is my proposal for a method to wait for a job to finish, regarding #240, for the 21.08 version.

I added the wait_finished method to the job class. It is implemented using the find_id method of the job class. I also added a test method. I've tested this in the container, as mentioned in the README. At least the test for wait_finished passes (although some others are not, which also fail in the version without my changes). Sadly, it turns out the cluster I am able to use, does not run a SLURM version, which I can use with PySlurm. Therefore, I could not test it on the real cluster (but we will get a SLURM update in the next weeks, maybe I will try this again over with a different version).

I'm looking forward to your thoughts and comments!

Best, Jonathan

tazend commented 2 years ago

Hi @JonaOtto

thanks for the addition. I think instead of merging into the 21.08 branch, it would be better to directly merge into main. After that, I can simply backport it to the 21.08 branch (and maybe even older branches). I'd say its better this way around.

main is 22.05, though from looking at the code, it should definitely work just as in 21.08. I have a 22.05 installation, so I can test it

tazend commented 2 years ago

Hi @JonaOtto

just had another look at it. I think there might be a little problem with the logic though.

The problem is that the only condition tested for here is the state of COMPLETED. A job that has finished is however not always in a state of COMPLETED, there are multiple different ending states. For example, one of them might be a state of TIMEOUT.

If you call wait_finished on a job and it lands in a state TIMEOUT, the function will continue to block, even if the job has technically ended (in a finished state), so long until the job is purged from the slurm controllers internal memory (which is a configurable time-window, default I think 300 seconds). When the job is purged, then the function will yield an error, since it is still TIMEOUT != COMPLETED, but the job then doesn't exist anymore and find_id raises a ValueError Exception (which might be fine, though in this case it would solely occur because it wasn't correctly determined that the job actually ended)

Instead, it would be more benefitial to work with the raw job-pointer, and therefore the raw state info. I think we just need to extract the slurm_load_job logic from find_id into a seperate internal function to simply load the job and get access to the raw job-pointer, instead of the parsed dict. So basically this (not complete):

cdef _load_single_job(jobid):
        cdef:
            int apiError
            int rc

        if isinstance(jobid, int) or isinstance(jobid, long):
            jobid = str(jobid).encode("UTF-8")
        else:
            jobid = jobid.encode("UTF-8")

        jobid_xlate = slurm.slurm_xlate_job_id(jobid)
        rc = slurm.slurm_load_job(&self._job_ptr, jobid_xlate, self._ShowFlags)

        ...error handling...

With the pointer, we can then do:

while not IS_JOB_FINISHED(self._job_ptr):
    ...

The IS_JOB_FINISHED function correctly determines if the job is in a finished state.

What do you think?

JonaOtto commented 2 years ago

Hi @tazend, Yeah, I actually had a similar implementation tested yesterday, which used the IS_JOB_COMPLETED function. But to get to the pointer, I had to (re)use a lot of the logic like it is in find_id (like the snippet you wrote above). And since I did not know/thought about these different finished states, I did not catch what was different from doing it the "python"-way and thought it was no different. What you say sounds good to me, and it is probably a good idea to extract the slurm_load_job functionality into a __load_single_job or similar :+1: This way it should work correctly and it will not be a code repetition. I will try to correct it and update the branch/PR accordingly.

EDIT: typo

JonaOtto commented 2 years ago

Hello pyslurm devs, I have pushed my updates for the wait_finished method. It now uses the raw job pointer, as discussed before. I also did two additional things:

To use the [0]-index not really makes sense with the job pointer I guess, since in case of a job array in the job, it will only wait for the first job of them. I extended the method to also work with job arrays and also added a new test method for this.
Because I thought it is convenient and to match the semantics of sbatch --wait, I made wait_finished return the exit code, instead of None. In case of a job array, it will be the highest of all exit codes from the jobs in the job array (like sbatch --wait does it).

I tested this in the container, again. Like before, the tests for wait_finished are passing, some others are not. Since we actually got an update to Slurm 21.08 on our cluster yesterday, I ran some example test scripts which used pyslurm with the wait_finished (and other) methods, like I intend to do, and everything seems working for me. I've tested it with multiple configurations of single jobs, arrays, 0 and non-zero exit codes.

Let me know what you think :) best regards Jonathan

tazend commented 2 years ago

Hi @JonaOtto

Except for the one thing I mentioned with the potential memory-leak which needs to be fixed, everything else looks good. Additionally I'd like to have this directly into the main branch (instead of 21.08) and from there I can backport it to the 21.08 branch (seems better this way around). I think the target branch of the PR can be changed and it should still work.

JonaOtto commented 2 years ago

Hi @tazend, Sure, base branch is changed. If this not works for some reason, let me know, I could copy my changes to main and open a new PR (I think source branches cannot be changed). About the potential memory leak: I'm sorry, but I'm not sure what mention you reference to. I guess I miss something, but I cannot work out what it is at the moment. Should I introduce slurm.slurm_free_job_info_msg(self._job_ptr) into the end of wait_finished?

tazend commented 1 year ago

Hi @JonaOtto

I mentioned the memory leak in line 2921 (I hope my comment shows up there as review?)

Yes, after line 2934 - still within the while loop, but after the for-loop, you need to add slurm.slurm_free_job_info_msg(self._job_ptr). Without that and with each new try to see if the job has finished, more and more memory will be leaked due to not cleaning up the old job-pointer during the while-loop.

JonaOtto commented 1 year ago

Hi @tazend

Oh okay, maybe that's on me, I'm still a noob when it comes to all these cool GitHub features. I do not see your comment, but maybe I did not know where to look for.

Anyway, we thought about the same thing anyway, give me a minute, I will do the fix.

EDIT: typo

JonaOtto commented 1 year ago

Okay, fix pushed. There should now be pairs of _load_single_job and slurm_free_job_info_msg each. Thanks for all your help @tazend!

tazend commented 1 year ago

@JonaOtto

alright, everything looks good! Yeah it was my fault, I added the comment but didn't actually submit the review so it wouldn't show up for you.

Anyway, we can merge it now, thanks for the contribution :)

JonaOtto commented 1 year ago

Thanks for merging :)

PySlurm / pyslurm

Wait_finished method for job API (regarding #240) #242