facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm
MIT License
1.28k stars 121 forks source link

Access some information about the job when reloading it #34

Open leonardblier opened 3 years ago

leonardblier commented 3 years ago

Hi!

Would it be possible to have access to some information about a job when reloading a Job with its job_id?

My use case is the following: I launched a lot of jobs, and I want to plot some metrics I logged. Most of the time, I only care about the jobs I just launched, or the jobs I launched the day before. Therefore, I would need to filter my jobs according to their launching time. If I'm correct, this is not currently possible.

Other information might be interesting, for instance knowing whether a job has been preempted, since this is a common bug source.

I tag @jrapin here because I talked with him about this feature.

jrapin commented 3 years ago

The way I see it, we can easily get the start time and the preemption times through the logs. Submission time is harder, either we append it to the log manually, or we add it in the DelayedFunction object (although accessing it would require loading the pickle which may be heavy, and it would not be preemption proof, so not sure). Also, I have no clear idea on an API for that, any thoughts @gwenzek ?

gwenzek commented 3 years ago

If we are talking about SLURM then sacct already know all the information we want (and more) about the job: start time, end time, cpu utilization, disk read write, ... Maybe we could add a Python API to expose this. But that's maybe over-engineering and will be pretty slurm specific.

@leonardblier what are you doing with you jobs ? And how do you find the list of past job ? Because to get the time you can just look at the timestamp of the job.paths.submission_file

jrapin commented 3 years ago

I would be careful at avoiding extra calls to the cluster, unless everything goes through the watcher

gwenzek commented 3 years ago

Adding a Python API would be as easy as reading an self.sacct_fields in the SlurmInfoWatcher and use it here instead of "JobID,State,NodeList". Then one could modify the list of fields through job.watcher.sacct_fields.extend(["TresUsageInMax", "TresUsageInAve"]) and read it through job.get_info()["TresUsageInMax"]

See the following commit that added NodeList: https://github.com/facebookincubator/submitit/pull/1615/commits/19b3487384b333c0653566db6ebd3da9d9af65ec#diff-1d3775b96c8f577427238099cf12f582b460fac12b9c2bb7c7f66abdceb6db49R50