Closed jolange closed 2 years ago
Now I am confused:
<Client: No scheduler connected>
sounds suspicious.This is really shaky: after adding debug output (f03f25768918528517701d0fb8fe38803e386c4d), the CI for HTCondor was successfull, but failed again later.
Now, this is an example stderr of a worker (from this run):
2022-08-08 13:40:37,987 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.0.3:44283'
2022-08-08 13:40:39,274 - distributed.worker - INFO - Start worker at: tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO - Listening to: tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO - dashboard at: 172.18.0.3:38721
2022-08-08 13:40:39,276 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.0.5:36311
2022-08-08 13:40:39,276 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,277 - distributed.worker - INFO - Threads: 1
2022-08-08 13:40:39,277 - distributed.worker - INFO - Memory: 100.00 MiB
2022-08-08 13:40:39,278 - distributed.worker - INFO - Local Directory: /var/lib/condor/execute/dir_85/dask-worker-space/worker-wfg9pxc7
2022-08-08 13:40:39,278 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,285 - distributed.worker - INFO - Registered to: tcp://172.18.0.5:36311
2022-08-08 13:40:39,285 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,286 - distributed.core - INFO - Starting established connection
2022-08-08 13:40:39,422 - distributed.worker_memory - WARNING - Worker tcp://172.18.0.3:42825 (pid=93) exceeded 95% memory budget. Restarting...
2022-08-08 13:40:39,430 - distributed.nanny - INFO - Worker process 93 was killed by signal 15
2022-08-08 13:40:39,435 - distributed.nanny - WARNING - Restarting worker
2022-08-08 13:40:40,714 - distributed.worker - INFO - Start worker at: tcp://172.18.0.3:45541
2022-08-08 13:40:40,714 - distributed.worker - INFO - Listening to: tcp://172.18.0.3:45541
2022-08-08 13:40:40,714 - distributed.worker - INFO - dashboard at: 172.18.0.3:33933
2022-08-08 13:40:40,714 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.0.5:36311
2022-08-08 13:40:40,714 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:40,714 - distributed.worker - INFO - Threads: 1
2022-08-08 13:40:40,714 - distributed.worker - INFO - Memory: 100.00 MiB
2022-08-08 13:40:40,714 - distributed.worker - INFO - Local Directory: /var/lib/condor/execute/dir_85/dask-worker-space/worker-mgz6j_u2
2022-08-08 13:40:40,714 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:40,723 - distributed.worker_memory - WARNING - Worker tcp://172.18.0.3:42825 (pid=102) exceeded 95% memory budget. Restarting...
2022-08-08 13:40:40,737 - distributed.nanny - INFO - Worker process 102 was killed by signal 15
2022-08-08 13:40:40,740 - distributed.nanny - WARNING - Restarting worker
2022-08-08 13:40:40,771 - distributed._signals - INFO - Received signal SIGTERM (15)
2022-08-08 13:40:40,772 - distributed.nanny - INFO - Closing Nanny at 'tcp://172.18.0.3:44283'.
2022-08-08 13:40:40,772 - distributed.nanny - INFO - Nanny asking worker to close
@guillaumeeb Is the problem simply WARNING - Worker tcp://172.18.0.3:42825 (pid=93) exceeded 95% memory budget. Restarting...
? But Slurm, SGE, PBS, ... use 2GB for this test. I'll try that now.
@jolange I think you're on something!! It looks like 100MiB is not enough for running a Dask Worker!
However, I believe the default Condor setup is only 1GB available on each condor worker node, so you should use a number lower than that. Maybe try with 500MiB to be safe?
Ah, thanks, I was just trying to find out what the available memory could be. With 2GiB the job did not start to run, so that seemed too much ;-) I'm trying with 500GiB now.
With 500GiB it worked without the warning in stderr and I also had a successful CI run for HTCondor. Still, the last run resultet in a timeout again, but that also happens for "CI / build (none)" for the LocalCluster from time to time.
Just tried a complement fix on your branch, hope it's okay. The second test was probably fragile too because it also used only 100MiB for worker jobs. If that test fails and workers are note cleaned up, other test will fail.
Okay, HTCondor CI is green, nice :clap:. Thanks a lot @jolange!
I will just make another commit here to re-add some of the debug tricks you used, it could be nice later on to have worker logs again!
Nice, thanks!
Due to the config read with
default=[]
,env_extra
will not stayNone
but become an empty list. This resulted in a command template starting with a semicolon.By first merging
env_extra
and_command_template
to a single list, this is avoided.Possibly related to #568, cc @guillaumeeb