I noticed that in my ad-hoc cluster of 2 nodes that I configured with 2 workers had 4 workers that were connecting according to the debug_log in runinfo:
two duplicated IP connections per node are being made
2021/10/21 20:58:43.91 work_queue_python[47302] tcp: got connection from 67.58.56.46 port 32904
2021/10/21 20:58:43.91 work_queue_python[47302] wq: worker 67.58.56.46:32904 connected
2021/10/21 20:58:43.91 work_queue_python[47302] tcp: got connection from 67.58.56.48 port 35604
2021/10/21 20:58:43.91 work_queue_python[47302] wq: worker 67.58.56.48:35604 connected
2021/10/21 20:58:43.91 work_queue_python[47302] tcp: got connection from 67.58.56.46 port 32906
2021/10/21 20:58:43.91 work_queue_python[47302] wq: worker 67.58.56.46:32906 connected
2021/10/21 20:58:43.91 work_queue_python[47302] tcp: got connection from 67.58.56.46 port 32908
2021/10/21 20:58:43.91 work_queue_python[47302] wq: worker 67.58.56.46:32908 connected
2021/10/21 20:58:43.91 work_queue_python[47302] wq: rx from unknown (67.58.56.46:32904): workqueue 9 discworld.crbs.ucsd.edu Linux x86_64 7.3.3
2021/10/21 20:58:43.91 work_queue_python[47302] wq: 1 workers are connected in total now
2021/10/21 20:58:43.91 work_queue_python[47302] wq: discworld.crbs.ucsd.edu (67.58.56.46:32904) running CCTools version 7.3.3 on Linux (operating system) with architecture x86_64 is ready
2021/10/21 20:58:43.91 work_queue_python[47302] wq: rx from unknown (67.58.56.48:35604): workqueue 9 ridcully.crbs.ucsd.edu Linux x86_64 7.3.3
2021/10/21 20:58:43.91 work_queue_python[47302] wq: 2 workers are connected in total now
2021/10/21 20:58:43.91 work_queue_python[47302] wq: ridcully.crbs.ucsd.edu (67.58.56.48:35604) running CCTools version 7.3.3 on Linux (operating system) with architecture x86_64 is ready
2021/10/21 20:58:43.91 work_queue_python[47302] wq: rx from unknown (67.58.56.46:32908): workqueue 9 discworld.crbs.ucsd.edu Linux x86_64 7.3.3
2021/10/21 20:58:43.91 work_queue_python[47302] wq: 3 workers are connected in total now
2021/10/21 20:58:43.91 work_queue_python[47302] wq: discworld.crbs.ucsd.edu (67.58.56.46:32908) running CCTools version 7.3.3 on Linux (operating system) with architecture x86_64 is ready
2021/10/21 20:58:43.91 work_queue_python[47302] wq: rx from unknown (67.58.56.46:32906): workqueue 9 discworld.crbs.ucsd.edu Linux x86_64 7.3.3
2021/10/21 20:58:43.91 work_queue_python[47302] wq: 4 workers are connected in total now
I believe these are being caused by old configurations of my workQueueExecutor not properly terminating the worker instances on the execution side, leaving these workers to dwell and reconnect to a new instance.
I made a temporary bash utility as a temp fix that can be executed on Adhoc Nodes to clear any process that connected to the Master WorkQueue by scanning through the Parsl script dir:
grep 'WORKER_IP:' *.sh.out | sed 's/^.*: //' | grep -o '.....$' | xargs -L1 echo | awk '{print $1"/tcp"}' | xargs -L1 sudo fuser -k
I noticed that in my ad-hoc cluster of 2 nodes that I configured with 2 workers had 4 workers that were connecting according to the debug_log in runinfo:
two duplicated IP connections per node are being made
I believe these are being caused by old configurations of my workQueueExecutor not properly terminating the worker instances on the execution side, leaving these workers to dwell and reconnect to a new instance.
I made a temporary bash utility as a temp fix that can be executed on Adhoc Nodes to clear any process that connected to the Master WorkQueue by scanning through the Parsl script dir:
grep 'WORKER_IP:' *.sh.out | sed 's/^.*: //' | grep -o '.....$' | xargs -L1 echo | awk '{print $1"/tcp"}' | xargs -L1 sudo fuser -k