Closed gdevenyi closed 7 years ago
Hrm, I'm not getting the failure mode here. By eye the outputs look the same. Plus now I'm using set equality, though I suspect the set isn't getting constructed correctly by breaking on \n
Blah, splitting didn't fix it :(
Still works locally as well
For whatever reason, one command output is being dropped, e.g:
AssertionError: Chunk 3: Expected echo 20 20
echo 21 21
echo 22 22
echo 23 23
echo 24 24
echo 25 25
echo 26 26
echo 27 27 <<---- this guy is missing
echo 28 28
echo 29 29
but got echo 20 20
echo 21 21
echo 22 22
echo 23 23
echo 24 24
echo 25 25
echo 26 26
echo 28 28
echo 29 29
Hrm, good catch, too bad I have no idea why it misses it :-/
Yes, and why it doesn't miss it locally.
One thing we can do to simplify the code and work around the string decoding issues is to default subprocess to decoding, via:
def command_pipe(command):
- return Popen(shlex.split(command), stdin=PIPE, stdout=PIPE, stderr=PIPE)
+ return Popen(shlex.split(command), stdin=PIPE, stdout=PIPE, stderr=PIPE,
+ universal_newlines=True)
Well, I've been playing with rerunning the travis build on my own branch of this PR (https://github.com/pipitone/qbatch/tree/set-comparison-test). It seems to fail about <25% of the time and I haven't been able to replicate it locally, but when I remove --line-buffer
from the parallel command, I don't seem to get failures...
I wonder if it's a bug in parallel.
Perhaps we should pull down the latest version to check? It's just an untar to install.
Yeah... I reproduced the travis environment: ubuntu 14.04 with parallel 20130922 and I get the same failure when using --line-buffer
, and also when I upgrade to parallel 20140122 (same version shipped with 16.04) although it maybe happens less frequently? Interestingly, I'm not seeing the failure using 20160822 on 14.04.
blah.
Scratch that. Just got a failure on 14.04 with 20160822.
I'm starting to wonder if this is an interaction problem between .communicate and parallel.
It seems people sometimes have issues with Popen/communicate and missing lines...
Do you have some pointers to where this problem is discussed?
On Sep 1, 2016, at 11:20 AM, "Gabriel A. Devenyi" notifications@github.com wrote:
I'm starting to wonder if this is an interaction problem between .communicate and parallel.
It seems people sometimes have issues with Popen/communicate and missing lines...
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Hmm.. interesting. I'll take a closer look. I wonder what changed in 16.04... ;-)
How badly do you want --line-buffer? It still seems to me that if you want "live" output from your commands, you can just handle redirection to a log file yourself (not friendly, but also not difficult).
I feel pretty strongly on this.
Right now, common usage of qbatch the way its intended results in empty log files if the job runs into batch system issues. This is very user unfriendly.
Fair point. I'm not sure what to do really. I'll keep investigating.
As a stop-gap, we could use --files or --results to get parallel to dump to files for us rather than a single shared log file. Thoughts?
@gdevenyi what do you think about using --files
or --results
with parallel so that it dumps individual log files? I'd vote doing that, or just leaving it up to the users to write their own output redirection if they really want realtime output.
After much testing I have determined that the problem is that GNU parallel sometimes loses lines in --line-buffer.
Tested in the latest version and it looks like it's still there. Going to attempt to report a bug.
Convert tests to use sets.
Will manually re-run travis a few times to see if this fixes the random errors.