NordicHPC / slurm2sql

Dump slurm accounting database to sqlite3 database for easy analysis
MIT License
9 stars 6 forks source link

Errors after running history-resume (Heterogeneous jobs) #15

Open mrhawkin opened 1 year ago

mrhawkin commented 1 year ago

Got this error first time I did history-resume: (It did not show up right away but after running some time)

Traceback (most recent call last):
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 798, in <module>
    exit(main(sys.argv[1:]))
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 564, in main
    errors = get_history(db, sacct_filter=sacct_filter,
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 635, in get_history
    errors += slurm2sql(db, sacct_filter=new_filter, update=True, jobs_only=jobs_only,
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 746, in slurm2sql
    processed_line = {k.strip('_'): (columns[k](line[k])

When I tried again I got this:

Traceback (most recent call last):
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 798, in <module>
    exit(main(sys.argv[1:]))
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 564, in main
    errors = get_history(db, sacct_filter=sacct_filter,
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 635, in get_history
    errors += slurm2sql(db, sacct_filter=new_filter, update=True, jobs_only=jobs_only,
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 746, in slurm2sql
    processed_line = {k.strip('_'): (columns[k](line[k])
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 749, in <dictcomp>
    else columns[k].calc(line))
  File "/cluster/home/haagen/slurm/slurm2sql.py", line 317, in calc
    return int(row['JobID'].split('_')[0].split('.')[0])
ValueError: invalid literal for int() with base 10: '18058278+0'
rkdarst commented 2 months ago

I think this is about "heterogeneous jobs". I recently (months ago?) saw this in some of our history: + can be in JobIDs to distinguish different parts of heterogeneous jobs. As an immediate workaround I made it ignore these, so it will be wrong/duplicates, but also it succeeds.

Our clusters users don't use them often, so I could ignore them for our statistics. But it might be good to add in some handling someday... if anyone needs, please ask.