brettc / partitionfinder

PartitionFinder discovers optimal partitioning schemes for DNA sequences.
Other
60 stars 42 forks source link

LG4X is a pain #109

Closed roblanf closed 8 years ago

roblanf commented 8 years ago

while running PF analyses again I stumbled over the following issue:

I am using 24 cores and I had a look in the phylofiles folder to see what the PF is working on. Looking at the file names, one can see the model raxml is using at the moment. My observation was: While analysing the starting scheme, - most of the time I looked in the phylofiles folder I only saw LG4X models running. In very rare cases one or two threads worked on other models, but soon all where working on LG4X models again. When thinking about this for a short time I came to the conclusion that this implies that the LG4X model must take more time than all other models together. I would even say: time for all models * N < time for LG4X where N >= probably 24, since otherwise I would expect to see more threads working on other models.

I remembered that I had thought about this before and that there will be a problem towards the end of the analysis, if the number of subsets PF is working on gets small. PF will often only use one thread even though it could use a large number of threads.

Well, a quick look at the code showed me that you must have had some similar ideas before: You sort subsets by length and you introduce a difficulty for the models. Then you analyse long subsets before shorter ones and more difficult models before less difficult models. So maybe everything is already OK I thought. Irrespective of the order, if the LG4X model is much slower all threads can just be stuck when analysing the LG4X model. But then I found that the LG4X is always analysed as the last model, which is non optimal, if it takes more time. Looking even further I found that I can obtain the final model difficulty by increasing the verbosity.

Doing this I obtain as part of the debug information: DEBUG | 2016-06-26 03:05:05,417 | raxml_mode | Model: DCMUT+G Difficulty: 2002 DEBUG | 2016-06-26 03:05:05,425 | raxml_mode | Model: WAG+G+F Difficulty: 5022 DEBUG | 2016-06-26 03:05:05,430 | raxml_mode | Model: LG+G Difficulty: 2002 DEBUG | 2016-06-26 03:05:05,436 | raxml_mode | Model: JTT+G+F Difficulty: 5022 DEBUG | 2016-06-26 03:05:05,441 | raxml_mode | Model: LG4X Difficulty: 6

So it appears to me that the model difficulty is very low for LG4X even though it takes more time than all other models together. This could be made more efficient if the model difficulty of the LG4X model would be increased above that of the +G+F models.

I have another idea, which unfortunately is probably much more difficult to implement:

While the PF is waiting for the last slow model to complete, it could use its threads to work on subsets that are not needed for this step, but which have a good chance that they are needed in future steps. Even for higher step numbers, PF often needs to analyse pairs or triples of data blocks. So even if rcluster-max is set to say 1000, the PF could use the "idle time" to analyse the next most promising candidates beyond the 1000 limit. Their result is kept "on hold" and not used until this combination of data blocks has to be computed. This procedure might analyse some combinations that are never needed, but I would guess that a lot of combinations are indeed needed in future steps. I agree that implementing this is far from trivial.

roblanf commented 8 years ago

Hi Christoph,

Thanks for spotting the difficulty stuff. I thought I'd done that before, but obviously not!! I'll fix that.

On the intelligent look-ahead stuff. I agree that it's desirable, but you're right that it's hard to implement, and for that reason I'm not going to do it. Of course, you are welcome to do it and submit a pull request :).

The same problem (waiting for a slow model to finish) is the whole inspiration behind the rclusterf algorithm, which seems to solve it very well and (in my experience using amino acid datasets) returns very good results for large datasets like the Misof et al dataset.

roblanf commented 8 years ago

Done. LG4 models are now ranked just behind GTR (protein) models in difficulty.

https://github.com/brettc/partitionfinder/commit/91bcfced76259816c7804b396e5f059fdb45a569

roblanf commented 8 years ago

well I almost did it last time. This time I really did it.

https://github.com/brettc/partitionfinder/commit/e6df54c11a48f7527a32081e9feb4ea0113a7359