Closed steenlysgaard closed 10 months ago
Hi @steenlysgaard
could you perhaps check how things are looking when you apply the same logic when using sacct
? In sacct
it would be CPUTimeRaw
that corresponds to elapsed_cpu_time
in pyslurm.
I'll also take your example and check out the issue on my Cluster to see whats going on.
Hi @tazend
Thanks, I can confirm that sacct
behaves exactly like the new API:
> sacct --starttime=2023-08-31T13:01:03 --endtime=2023-08-31T13:01:07 -o JobID,Start,End,CPUTimeRAW
JobID Start End CPUTimeRAW
------------ ------------------- ------------------- ----------
9 2023-08-31T13:01:05 2023-08-31T13:01:09 4
9.batch 2023-08-31T13:01:05 2023-08-31T13:01:09 4
> sacct --starttime=2023-08-31T13:01:07 --endtime=2023-08-31T13:01:11 -o JobID,Start,End,CPUTimeRAW
JobID Start End CPUTimeRAW
------------ ------------------- ------------------- ----------
9 2023-08-31T13:01:05 2023-08-31T13:01:09 4
9.batch 2023-08-31T13:01:05 2023-08-31T13:01:09 4
> sacct --starttime=2023-08-31T13:01:11 --endtime=2023-08-31T13:01:15 -o JobID,Start,End,CPUTimeRAW
JobID Start End CPUTimeRAW
------------ ------------------- ------------------- ----------
I noticed that in sacct --help
it says:
-S, --starttime:
Select jobs eligible after this time. ...
By eligble
, they apparently mean any job that is queued or running at that time.
Hi,
I see. I think what is needed is the -T / --truncate
flag in sacct
:
-T, --truncate
Truncate time. So if a job started before --starttime the start time would be truncated to --starttime.
The same for end time and --endtime.
The Job would still be found in both months, however the value for CPUTimeRaw
is correctly adjusted for actual time-span the user requested, i.e:
> sacct -T -S 2023-09-1T12:51:30 -E 2023-09-1T12:55:13 -o JobID,Start,End,CPUTimeRaw,TotalCPU
JobID Start End CPUTimeRAW TotalCPU
------------ ------------------- ------------------- ---------- ----------
277383 2023-09-01T12:51:35 2023-09-01T12:55:13 218 10:18.002
277383.batch 2023-09-01T12:51:35 2023-09-01T12:55:13 218 10:18.002
> sacct -T -S 2023-09-1T12:51:30 -E 2023-09-1T12:54:13 -o JobID,Start,End,CPUTimeRaw,TotalCPU
JobID Start End CPUTimeRAW TotalCPU
------------ ------------------- ------------------- ---------- ----------
277383 2023-09-01T12:51:35 2023-09-01T12:54:13 158 10:18.002
277383.batch 2023-09-01T12:51:35 2023-09-01T12:54:13 158 10:18.002
(The actual CPU efficiency, i.e. TotalCPU
can't be truncated though to the time-interval)
Can you confirm that this is how you would like it to be? Then I'd go ahead and implement this option
Yes, you are right the truncate option would work for my application.
Thanks!
Hi,
just added a truncate_time
option on this branch (for slurm 23.02) in the pyslurm.db.JobFilter
class:
import pyslurm
job_filter = pyslurm.db.JobFilter(truncate_time=True)
...
Additionally, I also added some new attributes to the pyslurm.db.Jobs
class for convenience, so some statistics about jobs in that collection are automatically summed up after retrieval, for example:
import pyslurm
db_jobs = pyslurm.db.Jobs.load()
print(db_jobs.elapsed_cpu_time)
print(db_jobs.cpus)
print(db_jobs.memory)
...
Feel free to test out the branch, I'll push it to master soon after adding documentation etc.
I just tried out the branch and it works as expected. Furthermore, the new attributes make gathering the statistics a little simpler.
Thanks!
Details
Issue
I am considering using pyslurm when gathering statistics for our cluster, however, I have found a non-ideal behaviour that makes collecting statistics a little cumbersome. I would like to get the cluster usage per month, so I get the jobs started in that month, do some sums, then move on to the next month, etc. However, I found that a job running when one month turns into the next is counted in both months.
This small example shows it. It can run on a test cluster:
Note that the issue also occurs on our cluster where the overlap times are hours and days.
Also note that the old API (
slurmdb_jobs
) also returns the job in both time intervals, however, the elapsed time is set to the amount of time the job ran in each time interval. I don't know if this is the correct way of handling it but at least it makes the elapsed time statistics correct (but not the number of jobs).