roblanf commented 7 years ago

Hello

We have installed PartitionFinder on our cluster, and we notice a strange behavior when we increase the number of threads (not MPI) combined with option --raxml to process a huge dataset :

The whole PartitionFinder process stays frozen waiting for raxml.linux sub processes, often marked as zombies.

With the example nucleotide dataset we noticed the same behavior, even with -p 8.

With debugging option and --save-phylofiles we checked if there was something wrong with RAxML ... launched sequentially alone outside PartitionFinder on the same data, all RAxML processes run without any problem.

We suspected a problem off buffer size (not set in the subprocess.Popen call) ... we set a comfortable one, and can go further in the data processing, but we still have main process blocked in the same way.

I've changed the code of run_program in partfinder/util.py to the following one, replacing subprocess.Popen by a basic old os.system, and now everything is OK :

def run_program(binary, command): unique_filename = uuid.uuid4() command = "\"%s\" %s 2> %s.err > %s.out" % (binary, command, unique_filename, unique_filename) log.debug("Running '%s'", command) returncode=os.system(command) if returncode != 0: raise ExternalProgramError("Exit %s: %s" % (returncode,command), "see %s.err %s.out files in project folder" % (unique_filename, unique_filename)) else: os.remove("%s.err" % (unique_filename)) os.remove("%s.out" % (unique_filename))

I've not tested another "old" solution found here https://bugs.python.org/issue12739

The tests were performed with :

partitionfinder-2.1.1 python 2.7.13 RAxML 8.2.9 compiled with gcc 4.4.7 (tests also made with gcc 6.1.0) CentOS release 6.5 (Final)

on Dell PowerEdge C6220 (2 x Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 10 cores with Hyper-Threading, 256G RAM)

The initial command line was : python PartitionFinder.py -p 20 --raxml --no-ml-tree examples/nucleotide/

We noticed also that in fact a -p 2 gave quite the same processing time than a -p 20 ...

Yours faithfully

Patrice Déhais

roblanf commented 7 years ago

I don't pretend to understand in detail what is happening here, but the link to the 'old' solution above contains a workaround that invovles locking threads.

We lock threads on line 50 of threadpool here: https://github.com/brettc/partitionfinder/blob/9ffa4272a48fc864a06faf588dd3dce641ef3aa8/partfinder/threadpool.py

So other things that may be going on are:

Our thread locking is not working for some reason
Our thread locking is working but is somehow not solving the issue reported here

As an empiricist, I will try the solution suggested by Patrice, and see what happens...

roblanf commented 7 years ago

One (minor) issue with the os.system solution above is that it puts a lot of files in the directory that has the alignment / .cfg file.

@brettc, assuming this solution is quicker (I'm still figuring that out) wouldn't it be better to move these to a temporary directory?

Another issue is whether or not that os.process command will work on Windows? I.e. will the pipes for stdout and stderr work? They work on Mac (I checked) and on Linux (of course) but I don't know about Windows yet.

roblanf commented 7 years ago

I used the first 20 datablocks of this dataset for a comparison (it's the dataset we used in the PF2 paper, but in that paper we used all datablocks which is closer to 200 datablocks):

The expectation here is that doubling p should roughly halve the processing time. There are limits because we have single-processor bottlenecks when selecting the best scheme etc. So halving is best-case.

Timings on my desktop mac (timing is from 'time'). 'New' is the solution proposed above. 'Old' is the current master branch of PF.

p    ver    time(s)
1    new    253
2    new   237
8    new    236
1    old    254
2    old    138
8    old    88

So, at least on my Mac desktop, the current master branch seems to perform as expected (more processors, faster analysis, up to the limit imposed by single-processor bottlenecks). The solution proposed above does not operate as expected.

Next I'll try it out on Linux. Perhaps there is some funky stuff going on with the OS?

roblanf commented 7 years ago

I should probably open a separate issue about the 'zombie' threads. On re-reading patrice's email, that might be the main issue here...

but until I do, @brettc do you think we can use any of the solutions here for the zombie processes (esp. process.wait()?): http://stackoverflow.com/questions/2760652/how-to-kill-or-avoid-zombie-processes-with-subprocess-module

roblanf commented 7 years ago

One more thought. The Python 2.7.x docs say this on buffering:

bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered).

Note If you experience performance issues, it is recommended that you try to enable buffering by setting bufsize to either -1 or a large enough positive value (such as 4096).

So, in addition to testing the solution above (and specifically checking for zombie processes). We should try setting bufsize to -1 when calling Popen.

roblanf commented 7 years ago

Timings from the server (processors are slower than my desktop, hence the slower rate on 1 processor etc). Versions are as above. Version 'buf' is with bufsize=-1 when calling Popen().

p    ver    time(s)
1    new    406
2    new    226
8    new    122
32   new    59
1    old    399
2    old    206
8    old    78
32   old    65
1    buf    392
2    buf    206
8    buf    78
32   buf    64

Two things of note:

setting the buffer size to -1 (which is to use the systems defaults) does nothing bad here.
The solution proposed above works really nicely for Linux - it's a little slower on up to 8 processors, but it's QUICKER than our current Popen() solution for 32 processors.

So, that new solution probably is getting to the heart of a key problem with our Popen() solution - I suspect the zombie processes might be an issue here but I haven't checked.

roblanf commented 7 years ago

I wonder if our splitting the output of stderr and stdout could be causing the zombies here @brettc?

Here's an interesting thread on stack overflow which seems to suggest something a bit like this:

http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output

roblanf commented 7 years ago

Here's another test. This time on the Fong dataset with 168 data blocks, but with only the first 10 taxa to speed things up a little.

Old: current master branch New: solution above err: new stdin and stderr handling to try and avoid zombies, and to avoid reading with p.communicate() except if there's an error

All analyses on the server.

p    ver    time(s)
32   old    3022
32   new    2971
32   err    2962

So, on that basis I see no practical difference between the three approaches. The 'err' approach might be worth considering. I'll publish the branch and we can see.

roblanf commented 7 years ago

'err' approach is published in this branch: https://github.com/brettc/partitionfinder/tree/buffer2

roblanf commented 7 years ago

Results of the user running the tests: still a plateau after 2 processors:


I've made tests with 1 2 8 20 proc on the same machine on our cluster (CentOS6.5 + GPFS shared file system), with the Fong dataset, using :
PartitionFinder v2.1.1 original code
PartitionFinder v2.1.1 with my os.system modification
PartitionFinder v2.1.1 branch buffer2
You will find in the attached document time results, and still a plateau for wallclock cpu time.
With the buffer2 branch, wallclock cpu time is lower beacause of the rawxml.linux call is less verbose with your modifications.
However, 2 CPU provides a real gain, but more CPU not really (more System CPU & Memory).
Your tests on Ubuntu gave not the same plateau.
I've checked if this difference was not due to time version, thus I've recompiled the gnu time, but on CentOS 6.5 there's still the same plateau.
I've checked if this was not due to old CentOS 6.5, thus I've checked on a modern workstation with a virtual machine (4 CPU) installed with Ubunto, then Fedora (latest sheet of attached document), but the plateau is still there.
And finaly, I've checked the file system : on the cluster we use a GPFS as scratch, on Fedora it was a LVM partitionning, thus I've tested also with a local file system and  the plateau is still there (1 2 4 CPU, thus little plateau)

Cores C S U M F

1 6,795.03 1,338.58 5,364.40 3579376 30 2 4,884.91 2,219.58 5,907.06 3602224 0 8 4,804.30 2,485.61 5,928.67 3679488 0 20 4,718.63 2,754.75 5,905.99 3867104 0

Use local storage instead of GPFS mount

Cores C S U M F

1 6,141.06 745.83 5,384.98 3,566,272.00 0 2 4,635.20 1,712.61 6,058.35 3,594,672.00 0 8 4,523.88 2,073.23 6,039.71 3,691,024.00 0 20 4,529.80 2,357.81 5,980.06 3,867,840.00 0

brettc / partitionfinder

buffer size issue #124

Cores C S U M F

Cores C S U M F