brettc / partitionfinder

PartitionFinder discovers optimal partitioning schemes for DNA sequences.
Other
60 stars 42 forks source link

Order of clustering steps can differ on different machines, even in reruns #13

Open cmayer opened 10 years ago

cmayer commented 10 years ago

Running PF in rcluster mode can result in different orders of the clustering steps.

I have started PF on one machine. Eventually it turned out that RAM was insufficient. In order to save time I copied the analysis folder to a machine with more RAM and continued the analysis there. When restarting PF on this data set, I expected it to make its way until the point where it stopped before calling raxml again, since it can read all results from its data base. However, after a few hundred clustering steps it conducted a step it did not do in the first run, so that raxml was called for all successive clustering steps to evaluate a small number of subsets.

Potential cause: Rounding errors due to different machines or after writing and reading from the data base could lead to this effect.

It's not a critical and potentially unavoidable issue.

roblanf commented 10 years ago

There are a couple of places this could happen:

  1. In the calculation of subset similarity. This uses a bunch of numpy routines to figure out the manhattan distance between subsets. There's a lot of opportunity for very slight differences to be important here, because typically in the first run of the algorithm we calculate distances for tens of millions of subset pairs, then rank them and choose the top ~1000. So, if there is ANY difference in the collection of subsets that make it into the top 1000, then you will see raxml firing up on the new computer in the very first step of the algorithm. This is my best guess as to the root of this issue.
  2. In the calculation of scheme AICc / BIC scores. What we store is a large collection of likelihoods - the likelihood of all subsets we found, analysed under all models in the list. When these are read in at the start of the run, we then use those likelihoods to calculate AICc / BIC / AIC scores of partitioning schemes. This involves a lot of calculations, and my guess if we have two partitioning schemes with very similar AICc/BIC scores, there is an opportunity for them to change order in the rankings, with similar effects to above.

In the end I don't think this will matter much. It may lead to very minor differences in the estimated partitioning schemes on different computers, but it's going to be practically impossible to avoid that anyway, because we can't control for floating point differences in e.g. RAxML and PhyML. In most cases though, this issue won't affect analyses at all, apart from changing the order in which things are done, and creating issues like Christoph is seeing.