Closed roblanf closed 9 years ago
Good idea.
There are several individual runs of k-means throughout the algorithm, shall we run them all with the same random number seed? Or design some method of storing a bunch of random number seeds, e.g. with the subset object.
We should be able to deterministically generate all random number seeds from a single input seed.
On 4 February 2015 at 07:01, Paul Frandsen notifications@github.com wrote:
Good idea.
There are several individual runs of k-means throughout the algorithm, shall we run them all with the same random number seed? Or design some method of storing a bunch of random number seeds, e.g. with the subset object.
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-72701124 .
Rob Lanfear School of Biological Sciences, Macquarie University, Sydney
phone: +61 (0)2 9850 8204
www.robertlanfear.com
Using "one seed to rule them all" should work, but repeatability gets a little harder with threading, as the order of drawing random numbers might change over runs. I suspect we might need to allocate separate random number generators (not just seeds) for each process.
Eeek.
Let's keep the threading part of this low down the list - mostly the seed option would be so that we could nail down bugs if we really want to.
There are a lot of other issues that are more important than this.
On 4 February 2015 at 09:26, Brett Calcott notifications@github.com wrote:
Using "one seed to rule them all" should work, but repeatability gets a little harder with threading, as things the order of drawing random numbers might change over runs. I suspect we might need to allocate separate random number generators (not just seeds) for each process.
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-72727399 .
Rob Lanfear School of Biological Sciences, Macquarie University, Sydney
phone: +61 (0)2 9850 8204
www.robertlanfear.com
As far as I understood Paul, the k-means clustering is rather fast.
A problem is that threading could be used rather straight forward by taking care of different "focal subsets" in different threads. But then we will run into problems with the seed. As Brett suggested we can use seed for each thread, chosen in some simple way. For example: take smallest site index of subset and multiply it with the number of sites in the subset. That could be sufficiently good as a seed for this subset. Of course the random number generator must be local to that thread.
Best Christoph
Am 03.02.2015 um 21:44 schrieb roblanf:
Eeek.
Let's keep the threading part of this low down the list - mostly the seed option would be so that we could nail down bugs if we really want to.
There are a lot of other issues that are more important than this.
On 4 February 2015 at 09:26, Brett Calcott notifications@github.com wrote:
Using "one seed to rule them all" should work, but repeatability gets a little harder with threading, as things the order of drawing random numbers might change over runs. I suspect we might need to allocate separate random number generators (not just seeds) for each process.
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-72727399 .
Rob Lanfear School of Biological Sciences, Macquarie University, Sydney
phone: +61 (0)2 9850 8204
www.robertlanfear.com — Reply to this email directly or view it on GitHub.
Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403
Zoologisches Forschungsmuseum Alexander Koenig
Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn
Hi,
I forgot something in my suggestion: The seed per thread must differ according to the seed specified by the user. So one could take: index *size + user_seed
Then different runs can have different seeds or the same, depending on the user_seed.
Best Christoph
Am 03.02.2015 um 22:06 schrieb Christoph Mayer:
As far as I understood Paul, the k-means clustering is rather fast.
A problem is that threading could be used rather straight forward by taking care of different "focal subsets" in different threads. But then we will run into problems with the seed. As Brett suggested we can use seed for each thread, chosen in some simple way. For example: take smallest site index of subset and multiply it with the number of sites in the subset. That could be sufficiently good as a seed for this subset. Of course the random number generator must be local to that thread.
Best Christoph
Am 03.02.2015 um 21:44 schrieb roblanf:
Eeek.
Let's keep the threading part of this low down the list - mostly the seed option would be so that we could nail down bugs if we really want to.
There are a lot of other issues that are more important than this.
On 4 February 2015 at 09:26, Brett Calcott notifications@github.com wrote:
Using "one seed to rule them all" should work, but repeatability gets a little harder with threading, as things the order of drawing random numbers might change over runs. I suspect we might need to allocate separate random number generators (not just seeds) for each process.
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-72727399 .
Rob Lanfear School of Biological Sciences, Macquarie University, Sydney
phone: +61 (0)2 9850 8204
www.robertlanfear.com — Reply to this email directly or view it on GitHub.
Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz Institut für Biodiversität der Tiere - Adenauerallee 160 53113 Bonn, Germany www.zfmk.de
Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn
Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403
Zoologisches Forschungsmuseum Alexander Koenig
Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn
hey all,
is there any reason anyone can see not to just fix the random number seed for all kmeans analyses to a single arbitrary value?
e.g.
kmeans_out = KMeans(init='k-means++', n_clusters=number_of_ks,
n_init=100, n_jobs=n_jobs, random_state = 2147483647)
As far as I can tell, this would solve all of our problems in the simplest possible way. Kmeans would work the same every time, nobody would have to try setting a seed, we don't need to worry about threading, etc. etc.
Tell me ASAP if you can see any problems here...
cos it feels icky?
I guess it would work fine. I don't think initialising a seed and storing it would be that hard either.
Also: the kmeans step always happens in sequential order right? So no problems with threads taking different times, and changing the order that kmeans is called?
On Fri, Oct 2, 2015 at 3:03 PM, roblanf notifications@github.com wrote:
hey all,
is there any reason anyone can see not to just fix the random number seed for all kmeans analyses to a single arbitrary value?
e.g.
kmeans_out = KMeans(init='k-means++', n_clusters=number_of_ks, n_init=100, n_jobs=n_jobs, random_state = 2147483647)
As far as I can tell, this would solve all of our problems in the simplest possible way. Kmeans would work the same every time, nobody would have to try setting a seed, we don't need to worry about threading, etc. etc.
Tell me ASAP if you can see any problems here...
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144898430 .
I don't think it's a bad thing to do. It's totally pragmatic. Here's the thing:
we have absolutely no way to know what a 'good' kmeans seed is going to be, and most of the time it makes zero difference. I just tested it on 10 datasets and I couldn't find one where NOT setting the seed led to different kmeans results.
So really, by fixing it we are just addressing the small (but annoying for users) number of cases where kmeans results depend on the seed.
Also, adding YET ANOTHER commandline option is something I'm keen to avoid. I think one of the good things about PF is that we only present options to users that are actually useful. Whether this option is useful to users (as opposed to just hard-coding it) is highly questionable.
BTW, I know we don't have to make it an option, and that we can do something more fancy w.r.t. generating new seeds. I just don't think we need to. Too much complexity. More potential issues. Etc.
Yep, I agree it is totally pragmatically justifiable. Just do it.
On Fri, Oct 2, 2015 at 3:31 PM, roblanf notifications@github.com wrote:
BTW, I know we don't have to make it an option, and that we can do something more fancy w.r.t. generating new seeds. I just don't think we need to. Too much complexity. More potential issues. Etc.
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901091 .
I agree. This sounds fine. I worked a but on it last night and it does get a bit confusing to keep track of all of the seeds if we make it an option.
On Thursday, October 1, 2015, Brett Calcott notifications@github.com wrote:
Yep, I agree it is totally pragmatically justifiable. Just do it.
On Fri, Oct 2, 2015 at 3:31 PM, roblanf <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
BTW, I know we don't have to make it an option, and that we can do something more fancy w.r.t. generating new seeds. I just don't think we need to. Too much complexity. More potential issues. Etc.
— Reply to this email directly or view it on GitHub < https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901091
.
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901977 .
Sent from my iPad
Hi Rob,
interesting simple approach.
I already suggested in an earlier mail to use the same seed for all subsets, since I think the circumstances in which they are used are sufficiently different so taking the same seed does not matter.
I am not worried about the optimality of the kmeans clustering. I think it should be fine. I never had the idea to repeat a kmeans run in the hope to get a better AICc due to a different clustering. But this is the only reason I could think of where someone might want to specify the seed.
For my data sets I hope that they take less than 60 days, and I wont start it with different seed. My feeling is the difference should be smaller for larger data sets.
By the way: here is another idea to simplify the code. https://xkcd.com/221/
:)
Best Christoph
Am 02.10.2015 um 04:03 schrieb roblanf notifications@github.com:
hey all,
is there any reason anyone can see not to just fix the random number seed for all kmeans analyses to a single arbitrary value?
e.g.
kmeans_out = KMeans(init='k-means++', n_clusters=number_of_ks, n_init=100, n_jobs=n_jobs, random_state = 2147483647)
As far as I can tell, this would solve all of our problems in the simplest possible way. Kmeans would work the same every time, nobody would have to try setting a seed, we don't need to worry about threading, etc. etc.
Tell me ASAP if you can see any problems here...
— Reply to this email directly or view it on GitHub.
Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403
Zoologisches Forschungsmuseum Alexander Koenig
Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn
Hi Brett,
Am 02.10.2015 um 04:26 schrieb Brett Calcott notifications@github.com:
cos it feels icky?
A bit. :)
I guess it would work fine. I don't think initialising a seed and storing it would be that hard either.
I would say that one seed for all subsets should be fine. For 99% of the people Robs seed should be fine. I hesitate to say, we need an option, even if it would be a bit more flexible to have one.
Best Christoph
Also: the kmeans step always happens in sequential order right? So no problems with threads taking different times, and changing the order that kmeans is called?
On Fri, Oct 2, 2015 at 3:03 PM, roblanf notifications@github.com wrote:
hey all,
is there any reason anyone can see not to just fix the random number seed for all kmeans analyses to a single arbitrary value?
e.g.
kmeans_out = KMeans(init='k-means++', n_clusters=number_of_ks, n_init=100, n_jobs=n_jobs, random_state = 2147483647)
As far as I can tell, this would solve all of our problems in the simplest possible way. Kmeans would work the same every time, nobody would have to try setting a seed, we don't need to worry about threading, etc. etc.
Tell me ASAP if you can see any problems here...
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144898430 .
— Reply to this email directly or view it on GitHub.
Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403
Zoologisches Forschungsmuseum Alexander Koenig
Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn
Hi Paul,
I think that one seed per run should be fine. The different kmeans runs in one analysis should be too different so that this is no problem.
Best Christoph
Am 02.10.2015 um 11:43 schrieb Paul Frandsen notifications@github.com:
I agree. This sounds fine. I worked a but on it last night and it does get a bit confusing to keep track of all of the seeds if we make it an option.
On Thursday, October 1, 2015, Brett Calcott notifications@github.com wrote:
Yep, I agree it is totally pragmatically justifiable. Just do it.
On Fri, Oct 2, 2015 at 3:31 PM, roblanf <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
BTW, I know we don't have to make it an option, and that we can do something more fancy w.r.t. generating new seeds. I just don't think we need to. Too much complexity. More potential issues. Etc.
— Reply to this email directly or view it on GitHub < https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901091
.
— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901977 .
Sent from my iPad — Reply to this email directly or view it on GitHub.
Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403
Zoologisches Forschungsmuseum Alexander Koenig
Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn
K-means is stochastic. This is fine, but to make sure we can replicate things, we should include a commandline option to set the random number seed, e.g.
--seed
which feeds through to the k-means output.
A sub-issue, is that we should record the seed (user specified or not) in the saved .cfg file, so that we can re-start and checkpoint easily (see issue 35 too: https://github.com/brettc/partitionfinder/issues/35)