Big greedy runs fail at ~50%

roblanf commented 10 years ago

A very helpful user of the develop branch said this:

"That exception [to otherwise good performance] is a PF-develop run (search=greedy, MrBayes-specific models) I attempted for comparison with a PF-1.1.1 run of the same parameters. As I have mentioned previously, the PF-1.1.1 search=greedy runs were going very slowly, so I was hoping the PF-develop run would be fast or even finish before the previously-started PF-1.1.1 run.

In the end, the PF-1.1.1 run took >25 days (28/Dec – 25/Jan) to finish. The PF-develop run started very quickly and progressed to ~50% in 4-5 days, after which progress pretty much stopped. I let it run for a few more days and thought it had locked up so restarted. After the restart I let it run for another 5-6 days with little progress, after which I needed the computer for other analyses and killed the job. Interestingly, the computer had written a >40GB swap trying to deal with this analysis."

Need to figure this out and fix it. My suspicion is that the current method of loading up ALL the schemes at once is no good (this is the big change in the greedy algorithm from 1.1.1). But it could also be something to do with the databasing - the greedy algorithm uses a lot of subsets, and I wonder if DB.py is getting overloaded (if so, what to do?). Third option - it's because we abandoned the weakref dictionary, and we're just keeping too many subsets around.

2 things to try:

Revert to old greedy algorithm, and run big analyses. See if we still get this error (assuming we can replicate it first)
If (1) fixes the error, we can stick with the greedy algorithm that yields schemes, but yield ~1000 schemes at a pop. That will keep most of the performance benefits and may work around the issue.
Go back to flushing out useless subsets from memory somehow. Should be simple enough to do, and could even move to the numpy solution that I use in the relaxed clustering algorithm. [NOTE TO SELF - the newest formulation greedy is == relaxed clustering with % = 100, so TRY THAT FIRST].

brettc commented 10 years ago

My guess is that it is the new algorithm loading all schemes. I doubt this is the database (though I guess it is possible), and if the new algorithm needs all schemes, then we need all subsets too, which means weakref dictionary won't do us any good.

roblanf commented 10 years ago

This issue is now assumed fixed in the following commit (we don't have the original dataset so can't actually reproduce the error to check): https://github.com/brettc/partitionfinder/commit/4d94e4088e0d06ccfe08e5b135fe5ff0ea921a94

In short, I went with solution 2, and analyse a maximum of 10K schemes at once.

If this doesn't fix the issue, then we will have to look into the database more.

brettc / partitionfinder

Big greedy runs fail at ~50% #9