seml silently kills pending experiments, when observing them with the status command, while they are pending in slurm

heborras commented 2 years ago

Hi all,

it seems like there is something strange going on with the status command and pending jobs on our slurm cluster. I'm not entirely sure why this happens, but it seems like experiments are killed silently when seml tries to determine if they have been killed externally on the slurm cluster, while they are still pending in the slurm queue. It might be some issue with how slurm displays the jobs on our cluster or it might be something else, which I'm not quite seeing.

Expected Behavior

Observing experiments using seml [db_collection_name] status during execution should show how the jobs move from staged to pending to running to completed for the example experiment.

Actual Behavior

Running seml [db_collection_name] status will kill pending jobs silently. Not running the command will run jobs as expected. The issue likely happens when the status command tires to detect killed experiments in this line: https://github.com/TUM-DAML/seml/blob/7d9352e51c9a83b77aa30617e8926863f0f48797/seml/manage.py#L22

Steps to Reproduce the Problem

Install version 0.3.6 of seml and clone the git repository
Start a second terminal window with watch seml seml_example status running
In the first terminal window: Add the example jobs to seml using: seml seml_example add example_config.yaml, the jobs now appear in the staged section in the second terminal window.
In the first terminal window: Start the jobs with: seml seml_example start, the jobs now appear in squeue, but they also immediately appear in the killed section in the second terminal window.
No or very few experiments complete.

To get the slurm jobs to start on our cluster, I had to change the partition to exercise in the example_config.yaml and reduced the maximum slurm time to 2 hours.

Error message:

Most logs look something like this:

(seml) [hborras@ceg-octane logs]$ cat example_experiment_68391_10.out
Starting job 68402
SLURM assigned me the node(s): ceg-brook01
WARNING: Experiment with ID 11 does not have status PENDING and will not be run.
Experiments are running under the following process IDs:

With ceg-brook01 being one of our GPU nodes.

The squeue output looks something like this during execution:

[hborras@ceg-octane ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           68391_4  exercise example_  hborras CG       0:05      1 ceg-brook01
           68391_6  exercise example_  hborras CG       0:05      1 ceg-brook01
     68391_[10-71]  exercise example_  hborras PD       0:00      1 (Resources)
           68391_9  exercise example_  hborras  R       0:00      1 ceg-brook02
           68391_7  exercise example_  hborras  R       0:01      1 ceg-brook02
           68391_8  exercise example_  hborras  R       0:01      1 ceg-brook02
           68391_5  exercise example_  hborras  R       0:05      1 ceg-brook01

Specifications

Details

- Version: 0.3.6 - Python version: 3.7 - Platform: ``` (seml) [hborras@ceg-octane examples]$ uname -a Linux ceg-octane 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ``` - Anaconda environment (`conda list`): ``` (seml) [hborras@ceg-octane examples]$ conda list # packages in environment at /home/hborras/.conda/envs/seml: # # Name Version Build Channel _libgcc_mutex 0.1 main _openmp_mutex 4.5 1_gnu ca-certificates 2021.10.26 h06a4308_2 certifi 2021.10.8 py37h06a4308_2 colorama 0.4.4 pypi_0 pypi debugpy 1.5.1 pypi_0 pypi docopt 0.6.2 pypi_0 pypi gitdb 4.0.9 pypi_0 pypi gitpython 3.1.27 pypi_0 pypi importlib-metadata 4.11.2 pypi_0 pypi jsonpickle 1.5.2 pypi_0 pypi libedit 3.1.20210910 h7f8727e_0 libffi 3.2.1 hf484d3e_1007 libgcc-ng 9.3.0 h5101ec6_17 libgomp 9.3.0 h5101ec6_17 libstdcxx-ng 9.3.0 hd4cf53a_17 munch 2.5.0 pypi_0 pypi ncurses 6.3 h7f8727e_2 numpy 1.21.5 pypi_0 pypi openssl 1.0.2u h7b6447c_0 packaging 21.3 pypi_0 pypi pandas 1.1.5 pypi_0 pypi pip 21.2.2 py37h06a4308_0 py-cpuinfo 8.0.0 pypi_0 pypi pymongo 4.0.1 pypi_0 pypi pyparsing 3.0.7 pypi_0 pypi python 3.7.0 h6e4f718_3 python-dateutil 2.8.2 pypi_0 pypi pytz 2021.3 pypi_0 pypi pyyaml 6.0 pypi_0 pypi readline 7.0 h7b6447c_5 sacred 0.8.2 pypi_0 pypi seml 0.3.6 pypi_0 pypi setuptools 58.0.4 py37h06a4308_0 six 1.16.0 pypi_0 pypi smmap 5.0.0 pypi_0 pypi sqlite 3.33.0 h62c20be_0 tk 8.6.11 h1ccaba5_0 tqdm 4.63.0 pypi_0 pypi typing-extensions 4.1.1 pypi_0 pypi wheel 0.37.1 pyhd3eb1b0_0 wrapt 1.13.3 pypi_0 pypi xz 5.2.5 h7b6447c_0 zipp 3.7.0 pypi_0 pypi zlib 1.2.11 h7f8727e_4 ```

gasteigerjo commented 2 years ago

Thank you so much for spending the time on this and posting such a great issue description!

So the experiments are only set to KILLED in the MongoDB, while they actually continue running on the cluster?

SEML itself should not kill anything, it only detects whether experiments were killed externally. As you correctly found, this happens in the detect_killed function. In particular, it sounds like the get_slurm_arrays_tasks does not actually get all Slurm jobs on your system.

This is rather hard to debug without access to your system and a debugger... Could you maybe print what the running_jobs variable contains after this line? If that is missing the entries you'd expect, we'll have to look into what is going wrong in get_slurm_arrays_tasks.

heborras commented 2 years ago

Thank you for the reply!

So the experiments are only set to KILLED in the MongoDB, while they actually continue running on the cluster?

Yes, that is exactly what I'm observing.

This is rather hard to debug without access to your system and a debugger... Could you maybe print what the running_jobs variable contains after this line? If that is missing the entries you'd expect, we'll have to look into what is going wrong in get_slurm_arrays_tasks.

So while jobs are active the function appears to return an empty dictionary, like this:

(seml) [hborras@ceg-octane ~]$ python
Python 3.7.0 (default, Oct  9 2018, 10:31:47)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from seml.manage import get_slurm_arrays_tasks
>>> get_slurm_arrays_tasks()
{}

So I had a look at the actual squeue command from this line:

>>> from seml.settings import SETTINGS
>>> f"SLURM_BITSTR_LEN=256 squeue -a -t {','.join(SETTINGS.SLURM_STATES.ACTIVE)} -h -o %i"
'SLURM_BITSTR_LEN=256 squeue -a -t PENDING,CONFIGURING,REQUEUE_FED,REQUEUE_HOLD,REQUEUED,RESIZING,RUNNING,SIGNALING,STOPPED,SUSPENDED,SPECIAL_EXIT -h -o %i'

And executing this gave me:

[hborras@ceg-octane ~]$ SLURM_BITSTR_LEN=256 squeue -a -t PENDING,CONFIGURING,REQUEUE_FED,REQU[20/22]
D,REQUEUED,RESIZING,RUNNING,SIGNALING,STOPPED,SUSPENDED,SPECIAL_EXIT -h -o %i
squeue: error: Invalid job state specified: SIGNALING
squeue: error: Valid job states include: PENDING,RUNNING,SUSPENDED,COMPLETED,CANCELLED,FAILED,TIMEOUT
,NODE_FAIL,PREEMPTED,BOOT_FAIL,DEADLINE,OUT_OF_MEMORY,COMPLETING,CONFIGURING,RESIZING,REVOKED,SPECIAL
_EXIT

Removing the SIGNALING state then gives me:

[hborras@ceg-octane ~]$ SLURM_BITSTR_LEN=256 squeue -a -t PENDING,CONFIGURING,REQUEUE_FED,REQUEUE_HOLD,REQUEUED,RESIZING,RUNNING,STOPPED,SUSPENDED,SPECIAL_EXIT -h -o %i
69047
69052
69066
69110_0
69110_1
69110_2
69110_3
69110_4
69110_5
69110_6
69110_7
69110_8
69110_9
69110_10
69110_11

Which now appears to work as expected. Does this output look okay to you as well? Since I'm not entirely sure what the Python code would expect here exactly.

What's a bit strange as well is that squeue doesn't appear to return a non-zero exit code, though it clearly errored out. Otherwise the check=True argument should have caught that in the run command. Maybe some additional error checking could be useful?

gasteigerjo commented 2 years ago

That's an easy fix then, since the Slurm states are defined in the settings. You just need to adjust the running states. Either in the repo's settings here or do the same in the user settings (which are expected in ~/.config/seml/settings.py by default).

This is most likely due to the slurm version in your cluster. Can you send me the output of sinfo --version?

But I agree that a more instructive error message would be good here if this is expected to fail depending on the Slurm version.

heborras commented 2 years ago

That's an easy fix then, since the Slurm states are defined in the settings. You just need to adjust the running states. Either in the repo's settings here or do the same in the user settings (which are expected in ~/.config/seml/settings.py by default).

Nice, I'll edit the config file then, since I'm using it for the Neptune token anyways :)

This is most likely due to the slurm version in your cluster. Can you send me the output of sinfo --version?

The slurm version is:

hborras@ceg-octane ~]$ sinfo --version
slurm 17.11.2

But I agree that a more instructive error message would be good here if this is expected to fail depending on the Slurm version.

Yeah, that would be really nice.

heborras commented 2 years ago

We recently updated slurm to version 21.08.6 on our cluster and now everything appears to work fine, even without the workaround. So, if you like you can close the issue.

TUM-DAML / seml