brettc / partitionfinder

PartitionFinder discovers optimal partitioning schemes for DNA sequences.
Other
61 stars 44 forks source link

Include a --seed option #36

Closed roblanf closed 9 years ago

roblanf commented 9 years ago

K-means is stochastic. This is fine, but to make sure we can replicate things, we should include a commandline option to set the random number seed, e.g.

--seed

which feeds through to the k-means output.

A sub-issue, is that we should record the seed (user specified or not) in the saved .cfg file, so that we can re-start and checkpoint easily (see issue 35 too: https://github.com/brettc/partitionfinder/issues/35)

pbfrandsen commented 9 years ago

Good idea.

There are several individual runs of k-means throughout the algorithm, shall we run them all with the same random number seed? Or design some method of storing a bunch of random number seeds, e.g. with the subset object.

roblanf commented 9 years ago

We should be able to deterministically generate all random number seeds from a single input seed.

On 4 February 2015 at 07:01, Paul Frandsen notifications@github.com wrote:

Good idea.

There are several individual runs of k-means throughout the algorithm, shall we run them all with the same random number seed? Or design some method of storing a bunch of random number seeds, e.g. with the subset object.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-72701124 .

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

brettc commented 9 years ago

Using "one seed to rule them all" should work, but repeatability gets a little harder with threading, as the order of drawing random numbers might change over runs. I suspect we might need to allocate separate random number generators (not just seeds) for each process.

roblanf commented 9 years ago

Eeek.

Let's keep the threading part of this low down the list - mostly the seed option would be so that we could nail down bugs if we really want to.

There are a lot of other issues that are more important than this.

On 4 February 2015 at 09:26, Brett Calcott notifications@github.com wrote:

Using "one seed to rule them all" should work, but repeatability gets a little harder with threading, as things the order of drawing random numbers might change over runs. I suspect we might need to allocate separate random number generators (not just seeds) for each process.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-72727399 .

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

cmayer commented 9 years ago

As far as I understood Paul, the k-means clustering is rather fast.

A problem is that threading could be used rather straight forward by taking care of different "focal subsets" in different threads. But then we will run into problems with the seed. As Brett suggested we can use seed for each thread, chosen in some simple way. For example: take smallest site index of subset and multiply it with the number of sites in the subset. That could be sufficiently good as a seed for this subset. Of course the random number generator must be local to that thread.

Best Christoph

Am 03.02.2015 um 21:44 schrieb roblanf:

Eeek.

Let's keep the threading part of this low down the list - mostly the seed option would be so that we could nail down bugs if we really want to.

There are a lot of other issues that are more important than this.

On 4 February 2015 at 09:26, Brett Calcott notifications@github.com wrote:

Using "one seed to rule them all" should work, but repeatability gets a little harder with threading, as things the order of drawing random numbers might change over runs. I suspect we might need to allocate separate random number generators (not just seeds) for each process.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-72727399 .

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com — Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


cmayer commented 9 years ago

Hi,

I forgot something in my suggestion: The seed per thread must differ according to the seed specified by the user. So one could take: index *size + user_seed

Then different runs can have different seeds or the same, depending on the user_seed.

Best Christoph

Am 03.02.2015 um 22:06 schrieb Christoph Mayer:

As far as I understood Paul, the k-means clustering is rather fast.

A problem is that threading could be used rather straight forward by taking care of different "focal subsets" in different threads. But then we will run into problems with the seed. As Brett suggested we can use seed for each thread, chosen in some simple way. For example: take smallest site index of subset and multiply it with the number of sites in the subset. That could be sufficiently good as a seed for this subset. Of course the random number generator must be local to that thread.

Best Christoph

Am 03.02.2015 um 21:44 schrieb roblanf:

Eeek.

Let's keep the threading part of this low down the list - mostly the seed option would be so that we could nail down bugs if we really want to.

There are a lot of other issues that are more important than this.

On 4 February 2015 at 09:26, Brett Calcott notifications@github.com wrote:

Using "one seed to rule them all" should work, but repeatability gets a little harder with threading, as things the order of drawing random numbers might change over runs. I suspect we might need to allocate separate random number generators (not just seeds) for each process.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-72727399 .

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com — Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

  • Leibniz Institut für Biodiversität der Tiere - Adenauerallee 160 53113 Bonn, Germany www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn



Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


roblanf commented 9 years ago

hey all,

is there any reason anyone can see not to just fix the random number seed for all kmeans analyses to a single arbitrary value?

e.g.

    kmeans_out = KMeans(init='k-means++', n_clusters=number_of_ks,
            n_init=100, n_jobs=n_jobs, random_state = 2147483647)

As far as I can tell, this would solve all of our problems in the simplest possible way. Kmeans would work the same every time, nobody would have to try setting a seed, we don't need to worry about threading, etc. etc.

Tell me ASAP if you can see any problems here...

brettc commented 9 years ago

cos it feels icky?

I guess it would work fine. I don't think initialising a seed and storing it would be that hard either.

Also: the kmeans step always happens in sequential order right? So no problems with threads taking different times, and changing the order that kmeans is called?

On Fri, Oct 2, 2015 at 3:03 PM, roblanf notifications@github.com wrote:

hey all,

is there any reason anyone can see not to just fix the random number seed for all kmeans analyses to a single arbitrary value?

e.g.

kmeans_out = KMeans(init='k-means++', n_clusters=number_of_ks,
        n_init=100, n_jobs=n_jobs, random_state = 2147483647)

As far as I can tell, this would solve all of our problems in the simplest possible way. Kmeans would work the same every time, nobody would have to try setting a seed, we don't need to worry about threading, etc. etc.

Tell me ASAP if you can see any problems here...

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144898430 .

roblanf commented 9 years ago

I don't think it's a bad thing to do. It's totally pragmatic. Here's the thing:

we have absolutely no way to know what a 'good' kmeans seed is going to be, and most of the time it makes zero difference. I just tested it on 10 datasets and I couldn't find one where NOT setting the seed led to different kmeans results.

So really, by fixing it we are just addressing the small (but annoying for users) number of cases where kmeans results depend on the seed.

Also, adding YET ANOTHER commandline option is something I'm keen to avoid. I think one of the good things about PF is that we only present options to users that are actually useful. Whether this option is useful to users (as opposed to just hard-coding it) is highly questionable.

roblanf commented 9 years ago

BTW, I know we don't have to make it an option, and that we can do something more fancy w.r.t. generating new seeds. I just don't think we need to. Too much complexity. More potential issues. Etc.

brettc commented 9 years ago

Yep, I agree it is totally pragmatically justifiable. Just do it.

On Fri, Oct 2, 2015 at 3:31 PM, roblanf notifications@github.com wrote:

BTW, I know we don't have to make it an option, and that we can do something more fancy w.r.t. generating new seeds. I just don't think we need to. Too much complexity. More potential issues. Etc.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901091 .

pbfrandsen commented 9 years ago

I agree. This sounds fine. I worked a but on it last night and it does get a bit confusing to keep track of all of the seeds if we make it an option.

On Thursday, October 1, 2015, Brett Calcott notifications@github.com wrote:

Yep, I agree it is totally pragmatically justifiable. Just do it.

On Fri, Oct 2, 2015 at 3:31 PM, roblanf <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

BTW, I know we don't have to make it an option, and that we can do something more fancy w.r.t. generating new seeds. I just don't think we need to. Too much complexity. More potential issues. Etc.

— Reply to this email directly or view it on GitHub < https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901091

.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901977 .

Sent from my iPad

roblanf commented 9 years ago

Done.

https://github.com/brettc/partitionfinder/commit/ff9dadf02178f677d14a279e7524d1f4eb22c496

cmayer commented 9 years ago

Hi Rob,

interesting simple approach.

I already suggested in an earlier mail to use the same seed for all subsets, since I think the circumstances in which they are used are sufficiently different so taking the same seed does not matter.

I am not worried about the optimality of the kmeans clustering. I think it should be fine. I never had the idea to repeat a kmeans run in the hope to get a better AICc due to a different clustering. But this is the only reason I could think of where someone might want to specify the seed.

For my data sets I hope that they take less than 60 days, and I wont start it with different seed. My feeling is the difference should be smaller for larger data sets.

By the way: here is another idea to simplify the code. https://xkcd.com/221/

:)

Best Christoph

Am 02.10.2015 um 04:03 schrieb roblanf notifications@github.com:

hey all,

is there any reason anyone can see not to just fix the random number seed for all kmeans analyses to a single arbitrary value?

e.g.

kmeans_out = KMeans(init='k-means++', n_clusters=number_of_ks,
        n_init=100, n_jobs=n_jobs, random_state = 2147483647)

As far as I can tell, this would solve all of our problems in the simplest possible way. Kmeans would work the same every time, nobody would have to try setting a seed, we don't need to worry about threading, etc. etc.

Tell me ASAP if you can see any problems here...

— Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


cmayer commented 9 years ago

Hi Brett,

Am 02.10.2015 um 04:26 schrieb Brett Calcott notifications@github.com:

cos it feels icky?

A bit. :)

I guess it would work fine. I don't think initialising a seed and storing it would be that hard either.

I would say that one seed for all subsets should be fine. For 99% of the people Robs seed should be fine. I hesitate to say, we need an option, even if it would be a bit more flexible to have one.

Best Christoph

Also: the kmeans step always happens in sequential order right? So no problems with threads taking different times, and changing the order that kmeans is called?

On Fri, Oct 2, 2015 at 3:03 PM, roblanf notifications@github.com wrote:

hey all,

is there any reason anyone can see not to just fix the random number seed for all kmeans analyses to a single arbitrary value?

e.g.

kmeans_out = KMeans(init='k-means++', n_clusters=number_of_ks, n_init=100, n_jobs=n_jobs, random_state = 2147483647)

As far as I can tell, this would solve all of our problems in the simplest possible way. Kmeans would work the same every time, nobody would have to try setting a seed, we don't need to worry about threading, etc. etc.

Tell me ASAP if you can see any problems here...

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144898430 .

— Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


cmayer commented 9 years ago

Hi Paul,

I think that one seed per run should be fine. The different kmeans runs in one analysis should be too different so that this is no problem.

Best Christoph

Am 02.10.2015 um 11:43 schrieb Paul Frandsen notifications@github.com:

I agree. This sounds fine. I worked a but on it last night and it does get a bit confusing to keep track of all of the seeds if we make it an option.

On Thursday, October 1, 2015, Brett Calcott notifications@github.com wrote:

Yep, I agree it is totally pragmatically justifiable. Just do it.

On Fri, Oct 2, 2015 at 3:31 PM, roblanf <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

BTW, I know we don't have to make it an option, and that we can do something more fancy w.r.t. generating new seeds. I just don't think we need to. Too much complexity. More potential issues. Etc.

— Reply to this email directly or view it on GitHub < https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901091

.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/36#issuecomment-144901977 .

Sent from my iPad — Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn