Open cdunn2001 opened 7 years ago
I found the reason for the slow-down of fs_based
on my Mac. This line is slow:
hostname = socket.getfqdn()
In fs_based
, that is purely informational, and it takes a few seconds on my Mac. I'll remove it...
Actually, I was able to make that fast by running scutil --set HostName $(scutil --get LocalHostName)
, as mentioned here, but that did not solve the problem with getaddrinfo()
I think the slow-down in network_based
is mostly from waiting on read/write buffers for logging, where we send stderr/stdout over the network.
If that's the reason, then it's a fair price. Flowers found a worse slow-down when many remote jobs write a lot of debug output into the filesystem simultaneously. (Still, I would rather solve that be logging less.)
But if getaddrinfo()
is called repeatedly, that could be the cause instead. We need to try logging direct to disk, just so we know for sure.
This made network_based
take the same amount of time as the other pwatchers:
diff --git pwatcher/mains/network_heartbeat.py pwatcher/mains/network_heartbeat.py
index 40bee48..821f010 100644
--- pwatcher/mains/network_heartbeat.py
+++ pwatcher/mains/network_heartbeat.py
@@ -140,18 +140,19 @@ sleep_s={sleep_s!r}""".format(
call = ' '.join(args.command)
log(heartbeat_server, jobid, 'In cwd: {}, Blocking call: {!r}'.format(
os.getcwd(), call))
- sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+ #sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+ sp = subprocess.Popen(shlex.split(call))
# forward all output to server until job ends, then get exit value
- with sp.stdout as f:
- for line in iter(f.readline, b''):
- # can't use log() for this because it appends a \n
It could be the repeated socket connections, or it could be the log-sending.
The slow-down is only 20% with this:
diff --git pwatcher/mains/network_heartbeat.py pwatcher/mains/network_heartbeat.py
index 40bee48..0eab693 100644
--- pwatcher/mains/network_heartbeat.py
+++ pwatcher/mains/network_heartbeat.py
@@ -141,17 +141,17 @@ sleep_s={sleep_s!r}""".format(
log(heartbeat_server, jobid, 'In cwd: {}, Blocking call: {!r}'.format(
os.getcwd(), call))
sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+ hsocket = socket.socket()
+ hsocket.connect(heartbeat_server)
# forward all output to server until job ends, then get exit value
with sp.stdout as f:
for line in iter(f.readline, b''):
# can't use log() for this because it appends a \n
- hsocket = socket.socket()
socket_send(hsocket, 's {} {}'.format(jobid, line))
- hsocket.close()
except IOError: # better to miss a line than terminate
pass
+ hsocket.close()
rc = sp.wait()
But that change causes only a few lines of logging to be sent. Much of stdout is lost. What is my mistake there? And could some stdout be dropped even without it?
On synth5k, the
blocking
pwatcher withjob_type=local
andjob_queue=bash -C ${CMD}
takes about 20secs;fs_based
takes 1.5mins;network_based
takes 3mins.