dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
235 stars 142 forks source link

Fix command template for empty `env_extra` in HTCondor #570

Closed jolange closed 2 years ago

jolange commented 2 years ago

Due to the config read with default=[], env_extra will not stay None but become an empty list. This resulted in a command template starting with a semicolon.

By first merging env_extra and _command_template to a single list, this is avoided.

Possibly related to #568, cc @guillaumeeb

jolange commented 2 years ago

Now I am confused:

jolange commented 2 years ago

This is really shaky: after adding debug output (f03f25768918528517701d0fb8fe38803e386c4d), the CI for HTCondor was successfull, but failed again later.

Now, this is an example stderr of a worker (from this run):

2022-08-08 13:40:37,987 - distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.3:44283'
2022-08-08 13:40:39,274 - distributed.worker - INFO -       Start worker at:     tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO -          Listening to:     tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO -          dashboard at:           172.18.0.3:38721
2022-08-08 13:40:39,276 - distributed.worker - INFO - Waiting to connect to:     tcp://172.18.0.5:36311
2022-08-08 13:40:39,276 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,277 - distributed.worker - INFO -               Threads:                          1
2022-08-08 13:40:39,277 - distributed.worker - INFO -                Memory:                 100.00 MiB
2022-08-08 13:40:39,278 - distributed.worker - INFO -       Local Directory: /var/lib/condor/execute/dir_85/dask-worker-space/worker-wfg9pxc7
2022-08-08 13:40:39,278 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,285 - distributed.worker - INFO -         Registered to:     tcp://172.18.0.5:36311
2022-08-08 13:40:39,285 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,286 - distributed.core - INFO - Starting established connection
2022-08-08 13:40:39,422 - distributed.worker_memory - WARNING - Worker tcp://172.18.0.3:42825 (pid=93) exceeded 95% memory budget. Restarting...
2022-08-08 13:40:39,430 - distributed.nanny - INFO - Worker process 93 was killed by signal 15
2022-08-08 13:40:39,435 - distributed.nanny - WARNING - Restarting worker
2022-08-08 13:40:40,714 - distributed.worker - INFO -       Start worker at:     tcp://172.18.0.3:45541
2022-08-08 13:40:40,714 - distributed.worker - INFO -          Listening to:     tcp://172.18.0.3:45541
2022-08-08 13:40:40,714 - distributed.worker - INFO -          dashboard at:           172.18.0.3:33933
2022-08-08 13:40:40,714 - distributed.worker - INFO - Waiting to connect to:     tcp://172.18.0.5:36311
2022-08-08 13:40:40,714 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:40,714 - distributed.worker - INFO -               Threads:                          1
2022-08-08 13:40:40,714 - distributed.worker - INFO -                Memory:                 100.00 MiB
2022-08-08 13:40:40,714 - distributed.worker - INFO -       Local Directory: /var/lib/condor/execute/dir_85/dask-worker-space/worker-mgz6j_u2
2022-08-08 13:40:40,714 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:40,723 - distributed.worker_memory - WARNING - Worker tcp://172.18.0.3:42825 (pid=102) exceeded 95% memory budget. Restarting...
2022-08-08 13:40:40,737 - distributed.nanny - INFO - Worker process 102 was killed by signal 15
2022-08-08 13:40:40,740 - distributed.nanny - WARNING - Restarting worker
2022-08-08 13:40:40,771 - distributed._signals - INFO - Received signal SIGTERM (15)
2022-08-08 13:40:40,772 - distributed.nanny - INFO - Closing Nanny at 'tcp://172.18.0.3:44283'.
2022-08-08 13:40:40,772 - distributed.nanny - INFO - Nanny asking worker to close

@guillaumeeb Is the problem simply WARNING - Worker tcp://172.18.0.3:42825 (pid=93) exceeded 95% memory budget. Restarting...? But Slurm, SGE, PBS, ... use 2GB for this test. I'll try that now.

guillaumeeb commented 2 years ago

@jolange I think you're on something!! It looks like 100MiB is not enough for running a Dask Worker!

However, I believe the default Condor setup is only 1GB available on each condor worker node, so you should use a number lower than that. Maybe try with 500MiB to be safe?

jolange commented 2 years ago

Ah, thanks, I was just trying to find out what the available memory could be. With 2GiB the job did not start to run, so that seemed too much ;-) I'm trying with 500GiB now.

jolange commented 2 years ago

With 500GiB it worked without the warning in stderr and I also had a successful CI run for HTCondor. Still, the last run resultet in a timeout again, but that also happens for "CI / build (none)" for the LocalCluster from time to time.

guillaumeeb commented 2 years ago

Just tried a complement fix on your branch, hope it's okay. The second test was probably fragile too because it also used only 100MiB for worker jobs. If that test fails and workers are note cleaned up, other test will fail.

guillaumeeb commented 2 years ago

Okay, HTCondor CI is green, nice :clap:. Thanks a lot @jolange!

I will just make another commit here to re-add some of the debug tricks you used, it could be nice later on to have worker logs again!

jolange commented 2 years ago

Nice, thanks!