TUM-DAML / seml

SEML: Slurm Experiment Management Library
Other
168 stars 30 forks source link

Error is thrown when throttled SLURM job arrays that are not related to SEML are present #47

Closed dobraczka closed 3 years ago

dobraczka commented 3 years ago

Expected Behavior

Experiments are running with SEML, while other seperate SLURM throttled job arrays are running. seml db status should return the status

Actual Behavior

An error is thrown

Steps to Reproduce the Problem

  1. Have SLURM Jobs with throttled job arrays running/pending
  2. Execute seml [db_name] status

Error message:

Traceback (most recent call last):
  File "[...]./local/bin/seml", line 10, in <module>
    sys.exit(main())
  File "[...].local/lib/python3.7/site-packages/seml/main.py", line 231, in main
    f(**args.__dict__)
  File "[...]/.local/lib/python3.7/site-packages/seml/manage.py", line 17, in report_status
    detect_killed(db_collection_name, print_detected=False)
  File "[...]/.local/lib/python3.7/site-packages/seml/manage.py", line 263, in detect_killed
    running_jobs = get_slurm_arrays_tasks()
  File "[...]/.local/lib/python3.7/site-packages/seml/manage.py", line 317, in get_slurm_arrays_tasks
    job_dict[array_id][0].append(range(int(lower), int(upper) + 1))
ValueError: invalid literal for int() with base 10: b'4%1'

Specifications

Details - Version: 0.3.4 - Python version: Python 3.7.1 - Platform: CentOS 7 - Anaconda environment (`conda list`): -
dobraczka commented 3 years ago

I have proposed a simple fix I have been using for this problem in my pull request.

gasteigerjo commented 3 years ago

Thank you for highlighting this!

I actually wanted to support this functionality a while ago, but then forgot about it. I've just fixed it all in b70fc9f. You can now restrict the number of simultaneous jobs per job array with the max_simultaneous_jobs config option. And we now correctly handle job arrays with %X.

Hrovatin commented 1 year ago

I also got this error, I use newer version of seml (see my versions in https://github.com/TUM-DAML/seml/issues/107).

gasteigerjo commented 1 year ago

It would be great if you could open a new issue for this and provide more context on your error message. Your error can't be due to the same underlying bug since it has long been fixed.