PySlurm / pyslurm

Python Interface to Slurm
https://pyslurm.github.io
GNU General Public License v2.0
467 stars 116 forks source link

A job is found with JobFilter if it is running on start_time #319

Closed steenlysgaard closed 10 months ago

steenlysgaard commented 10 months ago

Details

Issue

I am considering using pyslurm when gathering statistics for our cluster, however, I have found a non-ideal behaviour that makes collecting statistics a little cumbersome. I would like to get the cluster usage per month, so I get the jobs started in that month, do some sums, then move on to the next month, etc. However, I found that a job running when one month turns into the next is counted in both months.

This small example shows it. It can run on a test cluster:

import pyslurm
import time
from datetime import datetime, timedelta

# Set up a job - only for show
sjob = pyslurm.JobSubmitDescription(script='#!/bin/bash\nsleep 4\n')
job_id = sjob.submit()

# Wait for the job to finish
time.sleep(6.1)
job = pyslurm.db.Job.load(job_id=job_id)

# Establish some times
start_time = datetime.fromtimestamp(job.start_time)
mid_time = start_time + timedelta(seconds=2)
before_start_time = start_time - timedelta(seconds=6)
after_end_time = start_time + timedelta(seconds=6)

# Find jobs starting after "before_start_time"
filter = pyslurm.db.JobFilter(start_time=before_start_time, end_time=mid_time)
jobs = pyslurm.db.Jobs()
db_jobs = jobs.load(db_filter=filter)
print(db_jobs)
print(db_jobs[job_id].stats.elapsed_cpu_time)

# Find jobs starting after "mid_time" - should be empty
filter = pyslurm.db.JobFilter(start_time=mid_time, end_time=after_end_time)
db_jobs = jobs.load(db_filter=filter)
print(db_jobs)
print(db_jobs[job_id].stats.elapsed_cpu_time)

Note that the issue also occurs on our cluster where the overlap times are hours and days.

Also note that the old API (slurmdb_jobs) also returns the job in both time intervals, however, the elapsed time is set to the amount of time the job ran in each time interval. I don't know if this is the correct way of handling it but at least it makes the elapsed time statistics correct (but not the number of jobs).

tazend commented 10 months ago

Hi @steenlysgaard

could you perhaps check how things are looking when you apply the same logic when using sacct? In sacct it would be CPUTimeRaw that corresponds to elapsed_cpu_time in pyslurm.

I'll also take your example and check out the issue on my Cluster to see whats going on.

steenlysgaard commented 10 months ago

Hi @tazend

Thanks, I can confirm that sacct behaves exactly like the new API:

> sacct --starttime=2023-08-31T13:01:03 --endtime=2023-08-31T13:01:07 -o JobID,Start,End,CPUTimeRAW
JobID                      Start                 End CPUTimeRAW 
------------ ------------------- ------------------- ---------- 
9            2023-08-31T13:01:05 2023-08-31T13:01:09          4 
9.batch      2023-08-31T13:01:05 2023-08-31T13:01:09          4 
> sacct --starttime=2023-08-31T13:01:07 --endtime=2023-08-31T13:01:11 -o JobID,Start,End,CPUTimeRAW
JobID                      Start                 End CPUTimeRAW 
------------ ------------------- ------------------- ---------- 
9            2023-08-31T13:01:05 2023-08-31T13:01:09          4 
9.batch      2023-08-31T13:01:05 2023-08-31T13:01:09          4 
> sacct --starttime=2023-08-31T13:01:11 --endtime=2023-08-31T13:01:15 -o JobID,Start,End,CPUTimeRAW
JobID                      Start                 End CPUTimeRAW 
------------ ------------------- ------------------- ---------- 

I noticed that in sacct --help it says:

-S, --starttime:                                                       
                   Select jobs eligible after this time. ...

By eligble, they apparently mean any job that is queued or running at that time.

tazend commented 10 months ago

Hi,

I see. I think what is needed is the -T / --truncate flag in sacct:

-T, --truncate
                 Truncate time.  So if a job started before --starttime the start time would be truncated to --starttime. 
                 The same for end time and --endtime.

The Job would still be found in both months, however the value for CPUTimeRaw is correctly adjusted for actual time-span the user requested, i.e:

> sacct -T -S 2023-09-1T12:51:30 -E 2023-09-1T12:55:13 -o JobID,Start,End,CPUTimeRaw,TotalCPU
JobID                      Start                 End CPUTimeRAW   TotalCPU 
------------ ------------------- ------------------- ---------- ---------- 
277383       2023-09-01T12:51:35 2023-09-01T12:55:13        218  10:18.002 
277383.batch 2023-09-01T12:51:35 2023-09-01T12:55:13        218  10:18.002 

> sacct -T -S 2023-09-1T12:51:30 -E 2023-09-1T12:54:13 -o JobID,Start,End,CPUTimeRaw,TotalCPU
JobID                      Start                 End CPUTimeRAW   TotalCPU 
------------ ------------------- ------------------- ---------- ---------- 
277383       2023-09-01T12:51:35 2023-09-01T12:54:13        158  10:18.002 
277383.batch 2023-09-01T12:51:35 2023-09-01T12:54:13        158  10:18.002 

(The actual CPU efficiency, i.e. TotalCPU can't be truncated though to the time-interval)

Can you confirm that this is how you would like it to be? Then I'd go ahead and implement this option

steenlysgaard commented 10 months ago

Yes, you are right the truncate option would work for my application.

Thanks!

tazend commented 10 months ago

Hi,

just added a truncate_time option on this branch (for slurm 23.02) in the pyslurm.db.JobFilter class:

import pyslurm
job_filter = pyslurm.db.JobFilter(truncate_time=True)
...

Additionally, I also added some new attributes to the pyslurm.db.Jobs class for convenience, so some statistics about jobs in that collection are automatically summed up after retrieval, for example:

import pyslurm
db_jobs = pyslurm.db.Jobs.load()

print(db_jobs.elapsed_cpu_time)
print(db_jobs.cpus)
print(db_jobs.memory)
...

Feel free to test out the branch, I'll push it to master soon after adding documentation etc.

steenlysgaard commented 10 months ago

I just tried out the branch and it works as expected. Furthermore, the new attributes make gathering the statistics a little simpler.

Thanks!