Open roblanf opened 7 years ago
I don't pretend to understand in detail what is happening here, but the link to the 'old' solution above contains a workaround that invovles locking threads.
We lock threads on line 50 of threadpool here: https://github.com/brettc/partitionfinder/blob/9ffa4272a48fc864a06faf588dd3dce641ef3aa8/partfinder/threadpool.py
So other things that may be going on are:
As an empiricist, I will try the solution suggested by Patrice, and see what happens...
One (minor) issue with the os.system solution above is that it puts a lot of files in the directory that has the alignment / .cfg file.
@brettc, assuming this solution is quicker (I'm still figuring that out) wouldn't it be better to move these to a temporary directory?
Another issue is whether or not that os.process command will work on Windows? I.e. will the pipes for stdout and stderr work? They work on Mac (I checked) and on Linux (of course) but I don't know about Windows yet.
I used the first 20 datablocks of this dataset for a comparison (it's the dataset we used in the PF2 paper, but in that paper we used all datablocks which is closer to 200 datablocks):
The expectation here is that doubling p should roughly halve the processing time. There are limits because we have single-processor bottlenecks when selecting the best scheme etc. So halving is best-case.
Timings on my desktop mac (timing is from 'time'). 'New' is the solution proposed above. 'Old' is the current master branch of PF.
p ver time(s)
1 new 253
2 new 237
8 new 236
1 old 254
2 old 138
8 old 88
So, at least on my Mac desktop, the current master branch seems to perform as expected (more processors, faster analysis, up to the limit imposed by single-processor bottlenecks). The solution proposed above does not operate as expected.
Next I'll try it out on Linux. Perhaps there is some funky stuff going on with the OS?
I should probably open a separate issue about the 'zombie' threads. On re-reading patrice's email, that might be the main issue here...
but until I do, @brettc do you think we can use any of the solutions here for the zombie processes (esp. process.wait()?): http://stackoverflow.com/questions/2760652/how-to-kill-or-avoid-zombie-processes-with-subprocess-module
One more thought. The Python 2.7.x docs say this on buffering:
bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered).
Note If you experience performance issues, it is recommended that you try to enable buffering by setting bufsize to either -1 or a large enough positive value (such as 4096).
So, in addition to testing the solution above (and specifically checking for zombie processes). We should try setting bufsize to -1 when calling Popen.
Timings from the server (processors are slower than my desktop, hence the slower rate on 1 processor etc). Versions are as above. Version 'buf' is with bufsize=-1 when calling Popen().
p ver time(s)
1 new 406
2 new 226
8 new 122
32 new 59
1 old 399
2 old 206
8 old 78
32 old 65
1 buf 392
2 buf 206
8 buf 78
32 buf 64
Two things of note:
So, that new solution probably is getting to the heart of a key problem with our Popen() solution - I suspect the zombie processes might be an issue here but I haven't checked.
I wonder if our splitting the output of stderr and stdout could be causing the zombies here @brettc?
Here's an interesting thread on stack overflow which seems to suggest something a bit like this:
http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output
Here's another test. This time on the Fong dataset with 168 data blocks, but with only the first 10 taxa to speed things up a little.
Old: current master branch New: solution above err: new stdin and stderr handling to try and avoid zombies, and to avoid reading with p.communicate() except if there's an error
All analyses on the server.
p ver time(s)
32 old 3022
32 new 2971
32 err 2962
So, on that basis I see no practical difference between the three approaches. The 'err' approach might be worth considering. I'll publish the branch and we can see.
'err' approach is published in this branch: https://github.com/brettc/partitionfinder/tree/buffer2
Results of the user running the tests: still a plateau after 2 processors:
I've made tests with 1 2 8 20 proc on the same machine on our cluster (CentOS6.5 + GPFS shared file system), with the Fong dataset, using :
PartitionFinder v2.1.1 original code
PartitionFinder v2.1.1 with my os.system modification
PartitionFinder v2.1.1 branch buffer2
You will find in the attached document time results, and still a plateau for wallclock cpu time.
With the buffer2 branch, wallclock cpu time is lower beacause of the rawxml.linux call is less verbose with your modifications.
However, 2 CPU provides a real gain, but more CPU not really (more System CPU & Memory).
Your tests on Ubuntu gave not the same plateau.
I've checked if this difference was not due to time version, thus I've recompiled the gnu time, but on CentOS 6.5 there's still the same plateau.
I've checked if this was not due to old CentOS 6.5, thus I've checked on a modern workstation with a virtual machine (4 CPU) installed with Ubunto, then Fedora (latest sheet of attached document), but the plateau is still there.
And finaly, I've checked the file system : on the cluster we use a GPFS as scratch, on Fedora it was a LVM partitionning, thus I've tested also with a local file system and the plateau is still there (1 2 4 CPU, thus little plateau)
1 6,795.03 1,338.58 5,364.40 3579376 30 2 4,884.91 2,219.58 5,907.06 3602224 0 8 4,804.30 2,485.61 5,928.67 3679488 0 20 4,718.63 2,754.75 5,905.99 3867104 0
Use local storage instead of GPFS mount
1 6,141.06 745.83 5,384.98 3,566,272.00 0 2 4,635.20 1,712.61 6,058.35 3,594,672.00 0 8 4,523.88 2,073.23 6,039.71 3,691,024.00 0 20 4,529.80 2,357.81 5,980.06 3,867,840.00 0
Hello
We have installed PartitionFinder on our cluster, and we notice a strange behavior when we increase the number of threads (not MPI) combined with option --raxml to process a huge dataset :
The whole PartitionFinder process stays frozen waiting for raxml.linux sub processes, often marked as zombies.
With the example nucleotide dataset we noticed the same behavior, even with -p 8.
With debugging option and --save-phylofiles we checked if there was something wrong with RAxML ... launched sequentially alone outside PartitionFinder on the same data, all RAxML processes run without any problem.
We suspected a problem off buffer size (not set in the subprocess.Popen call) ... we set a comfortable one, and can go further in the data processing, but we still have main process blocked in the same way.
I've changed the code of run_program in partfinder/util.py to the following one, replacing subprocess.Popen by a basic old os.system, and now everything is OK :
def run_program(binary, command): unique_filename = uuid.uuid4() command = "\"%s\" %s 2> %s.err > %s.out" % (binary, command, unique_filename, unique_filename) log.debug("Running '%s'", command) returncode=os.system(command) if returncode != 0: raise ExternalProgramError("Exit %s: %s" % (returncode,command), "see %s.err %s.out files in project folder" % (unique_filename, unique_filename)) else: os.remove("%s.err" % (unique_filename)) os.remove("%s.out" % (unique_filename))
I've not tested another "old" solution found here https://bugs.python.org/issue12739
The tests were performed with :
partitionfinder-2.1.1 python 2.7.13 RAxML 8.2.9 compiled with gcc 4.4.7 (tests also made with gcc 6.1.0) CentOS release 6.5 (Final)
on Dell PowerEdge C6220 (2 x Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 10 cores with Hyper-Threading, 256G RAM)
The initial command line was : python PartitionFinder.py -p 20 --raxml --no-ml-tree examples/nucleotide/
We noticed also that in fact a -p 2 gave quite the same processing time than a -p 20 ...
Yours faithfully
Patrice Déhais