Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
498 stars 195 forks source link

Workers not connecting to interface when running on HTCondor #2141

Open Dominic-Stafford opened 2 years ago

Dominic-Stafford commented 2 years ago

I use the parsl HighThroughputExecutor to run jobs using HTCondor, which has worked fine for a couple of years, but for the past few weeks I've only been able to run jobs with a very small number of workers (<50), and if I try to run with more I seem to get network errors between the interface and the workers. The first indication is that I get the following warning in parsl.log:

parsl.serialize.facade:108 [WARNING]  Serialized object exceeds buffer threshold of 1048576 bytes, this could cause overflows

Then either my jobs seem to run OK for a time, but then all simultaneously stop, with no further output in parsl.log, or if the issue occurs before any jobs start, I get the following message in parsl.log:

parsl.executors.high_throughput.zmq_pipes:124 [DEBUG]  Not sending due to full zmq pipe, timeout: 362 ms

The workers then stay in the running state according to condor_q for 300s, then timeout due to missed heartbeats.

I think the problem is at least partly due to some change or issue with the cluster I am running on, as it only started recently, without me changing any of my parsl code, but I'm not sure how to debug this, and our cluster admins are also unsure what the issue is. Could you please point me in the right direction, and let me know if you need any more information about my set-up.

benclifford commented 2 years ago

The workers running inside HTCondor make a couple of network connections per node back to the submitting process (the python process where you ran parsl.load().

I have seen some network environments where those connections are treated strangely: for example on one system I have worked with, this kind of network connection would be allowed but only for a small time period and then would be silently ignored by the network configuration of the system, when connecting to a particular IP address of the submitting system. (That particular system was configured to allow unlimited connections when connecting to a different IP address, and that is how we resolved the issue there).

If you look inside your HTCondor job definitionss, you should see the port numbers and candidate IP addresses on the command line for proces_worker_pool, matching the --task-port, --result-port and --addresses parameters: see below for the template used to generate that:

            self.launch_cmd = ("process_worker_pool.py {debug} {max_workers} "
                               "-a {addresses} "
                               "-p {prefetch_capacity} "
                               "-c {cores_per_worker} "
                               "-m {mem_per_worker} "
                               "--poll {poll_period} "
                               "--task_port={task_port} "
                               "--result_port={result_port} "
                               "--logdir={logdir} "
                               "--block_id={{block_id}} "
                               "--hb_period={heartbeat_period} "
                               "{address_probe_timeout_string} "
                               "--hb_threshold={heartbeat_threshold} "
                               "--cpu-affinity {cpu_affinity} ")

On the network side, I would check that all the addresses being specified in --addresses are valid and there isn't something weird going on there: they should all be addresses of your submitting system that can be connected to from all of your worker nodes. If something looks wrong there and you need to force a specific address, you can add an address= parameter to the HighThroughputExecutor with the address you want.

I would also check on both the submit and worker side (if possible) with netstat to see if there is something suspicious with the above connections: for example, does one side show a large send buffer in netstat, and perhaps try to understand what is happening on the two TCP connections around the time the connection fails, using tcpdump. Although I wouldn't particular expect you to be learning tcpdump just for this.

Dominic-Stafford commented 2 years ago

Thank you for the quick response, and sorry for the delay in replying. I wasn't able to learn a lot with netstat, as everything looked fine on the submit side, though the workers did show a reasonably large send buffer. However, we recently realised a change to one of the packages we use had caused a fairly large (~4MB) dictionary to be passed along with every job (hence the Serialized object exceeds buffer threshold warning), and removing this seems to have resolved the issue.

benclifford commented 2 years ago

re-opening this: I think at least that serialisation warning should be reported as error, not a warning - if it really is something fatal like this.