dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
235 stars 142 forks source link

htcondor: add batch_name to match the name of the Dask worker #571

Closed jolange closed 2 years ago

jolange commented 2 years ago

This makes it easier to identify the workers in the output of condor_q, condor_history, ... for debugging.

Before, the batch name was automatically set by HTCondor to the job id (or rather ClusterId in HTCondor language):

OWNER   BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
jolange ID: 21671522   8/9  09:40      _      1      _      _      1 21671522.0

I think the name of the Dask worker is really useful to have here:

OWNER   BATCH_NAME           SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
jolange HTCondorCluster-1   8/9  10:17      _      1      _      _      1 21671526.0
jolange HTCondorCluster-0   8/9  10:17      _      1      _      _      1 21671527.0

So if you identify a worker (e.g. with problems) with cluster.workers or using the dashboard, it is much easier to find the right job in the batch system.

jolange commented 2 years ago

On a related note, it would be nice to also have the name in the worker's output (to stderr), so you can identify it in the log files:

2022-08-08 13:40:37,987 - distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.3:44283'
2022-08-08 13:40:39,274 - distributed.worker - INFO -       Start worker at:     tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO -          Listening to:     tcp://172.18.0.3:42825
2022-08-08 13:40:39,275 - distributed.worker - INFO -          dashboard at:           172.18.0.3:38721
2022-08-08 13:40:39,276 - distributed.worker - INFO - Waiting to connect to:     tcp://172.18.0.5:36311
2022-08-08 13:40:39,276 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,277 - distributed.worker - INFO -               Threads:                          1
2022-08-08 13:40:39,277 - distributed.worker - INFO -                Memory:                 100.00 MiB
2022-08-08 13:40:39,278 - distributed.worker - INFO -       Local Directory: /var/lib/condor/execute/dir_85/dask-worker-space/worker-wfg9pxc7
2022-08-08 13:40:39,278 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,285 - distributed.worker - INFO -         Registered to:     tcp://172.18.0.5:36311
2022-08-08 13:40:39,285 - distributed.worker - INFO - -------------------------------------------------
2022-08-08 13:40:39,286 - distributed.core - INFO - Starting established connection

I think I might propose to add it to distributed here.

2022-08-08 13:40:37,987 - distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.0.3:44283'
2022-08-08 13:40:39,274 - distributed.worker - INFO -       Start worker at:     tcp://172.18.0.3:42825
2022-08-08 13:40:39,274 - distributed.worker - INFO -           Worker name:          HTCondorCluster-0
jolange commented 2 years ago

Sure, I just added this!

guillaumeeb commented 2 years ago

On of the security test is still a bit random, any way, it's safe to merge, thanks @jolange!

jolange commented 2 years ago

Thanks for merging!

@guillaumeeb actually, I think the "CI / build (none) (pull_request)" part failed for each single push that I did here. It only succeeded when you re-triggered it. This looks almost like a systematic problem, altough I did not look at what happens exactly (because I saw "timeout" and also thought of a random effect).

guillaumeeb commented 2 years ago

Yep, I see this failure about 3 times on 4... It needs a fix.