network_based slow for job_type=local

cdunn2001 commented 7 years ago

On synth5k, the blocking pwatcher with job_type=local and job_queue=bash -C ${CMD} takes about 20secs; fs_based takes 1.5mins; network_based takes 3mins.

cdunn2001 commented 7 years ago

I found the reason for the slow-down of fs_based on my Mac. This line is slow:

hostname = socket.getfqdn()

In fs_based, that is purely informational, and it takes a few seconds on my Mac. I'll remove it...

Actually, I was able to make that fast by running scutil --set HostName $(scutil --get LocalHostName), as mentioned here, but that did not solve the problem with getaddrinfo()

cdunn2001 commented 7 years ago

I think the slow-down in network_based is mostly from waiting on read/write buffers for logging, where we send stderr/stdout over the network.

If that's the reason, then it's a fair price. Flowers found a worse slow-down when many remote jobs write a lot of debug output into the filesystem simultaneously. (Still, I would rather solve that be logging less.)

But if getaddrinfo() is called repeatedly, that could be the cause instead. We need to try logging direct to disk, just so we know for sure.

cdunn2001 commented 7 years ago

This made network_based take the same amount of time as the other pwatchers:

diff --git pwatcher/mains/network_heartbeat.py pwatcher/mains/network_heartbeat.py
index 40bee48..821f010 100644
--- pwatcher/mains/network_heartbeat.py
+++ pwatcher/mains/network_heartbeat.py
@@ -140,18 +140,19 @@ sleep_s={sleep_s!r}""".format(
     call = ' '.join(args.command)
     log(heartbeat_server, jobid, 'In cwd: {}, Blocking call: {!r}'.format(
         os.getcwd(), call))
-    sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+    #sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+    sp = subprocess.Popen(shlex.split(call))
     # forward all output to server until job ends, then get exit value
-    with sp.stdout as f:
-        for line in iter(f.readline, b''):
-            # can't use log() for this because it appends a \n

It could be the repeated socket connections, or it could be the log-sending.

cdunn2001 commented 7 years ago

The slow-down is only 20% with this:

diff --git pwatcher/mains/network_heartbeat.py pwatcher/mains/network_heartbeat.py
index 40bee48..0eab693 100644
--- pwatcher/mains/network_heartbeat.py
+++ pwatcher/mains/network_heartbeat.py
@@ -141,17 +141,17 @@ sleep_s={sleep_s!r}""".format(
     log(heartbeat_server, jobid, 'In cwd: {}, Blocking call: {!r}'.format(
         os.getcwd(), call))
     sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+    hsocket = socket.socket()
+    hsocket.connect(heartbeat_server)
     # forward all output to server until job ends, then get exit value
     with sp.stdout as f:
         for line in iter(f.readline, b''):
             # can't use log() for this because it appends a \n
-            hsocket = socket.socket()
                 socket_send(hsocket, 's {} {}'.format(jobid, line))
-                hsocket.close()
             except IOError:            # better to miss a line than terminate
                 pass
+    hsocket.close()
     rc = sp.wait()

But that change causes only a few lines of logging to be sent. Much of stdout is lost. What is my mistake there? And could some stdout be dropped even without it?

PacificBiosciences / pypeFLOW

network_based slow for job_type=local #38