Closed jolange closed 2 years ago
On a related note, it would be nice to also have the name in the worker's output (to stderr), so you can identify it in the log files:
2022-08-08 13:40:37,987 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.0.3:44283'
2022-08-08 13:40:39,274 - distributed.worker - INFO - Start worker at: tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO - Listening to: tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO - dashboard at: 172.18.0.3:38721
2022-08-08 13:40:39,276 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.0.5:36311
2022-08-08 13:40:39,276 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,277 - distributed.worker - INFO - Threads: 1
2022-08-08 13:40:39,277 - distributed.worker - INFO - Memory: 100.00 MiB
2022-08-08 13:40:39,278 - distributed.worker - INFO - Local Directory: /var/lib/condor/execute/dir_85/dask-worker-space/worker-wfg9pxc7
2022-08-08 13:40:39,278 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,285 - distributed.worker - INFO - Registered to: tcp://172.18.0.5:36311
2022-08-08 13:40:39,285 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,286 - distributed.core - INFO - Starting established connection
I think I might propose to add it to distributed
here.
2022-08-08 13:40:37,987 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.0.3:44283'
2022-08-08 13:40:39,274 - distributed.worker - INFO - Start worker at: tcp://172.18.0.3:42825
2022-08-08 13:40:39,274 - distributed.worker - INFO - Worker name: HTCondorCluster-0
Sure, I just added this!
On of the security test is still a bit random, any way, it's safe to merge, thanks @jolange!
Thanks for merging!
@guillaumeeb actually, I think the "CI / build (none) (pull_request)" part failed for each single push that I did here. It only succeeded when you re-triggered it. This looks almost like a systematic problem, altough I did not look at what happens exactly (because I saw "timeout" and also thought of a random effect).
Yep, I see this failure about 3 times on 4... It needs a fix.
This makes it easier to identify the workers in the output of
condor_q
,condor_history
, ... for debugging.Before, the batch name was automatically set by HTCondor to the job id (or rather ClusterId in HTCondor language):
I think the name of the Dask worker is really useful to have here:
So if you identify a worker (e.g. with problems) with
cluster.workers
or using the dashboard, it is much easier to find the right job in the batch system.