flashxio / FlashX

FlashX is a collection of big data analytics tools that perform data analytics in the form of graphs and matrices.
http://flashx.io/
Apache License 2.0
230 stars 53 forks source link

test kmeans (speed and accuracy) #50

Closed zheng-da closed 9 years ago

disa-mhembere commented 9 years ago

Speed

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s, 47 iters

R

kmeans (MacQueen): 287.347 (12 times slower), 37 iter kmens (Lloyd): 421.840 (18 times slower), 56 iter

jovo commented 9 years ago

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere notifications@github.com wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508.

it's our dream openconnecto.me, jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York

disa-mhembere commented 9 years ago

Great question jovo. I wanted to ask you, Carey and Youngser how to best tell whether the implementation is performing well with regards to the quality of clusters. Da suggested a scheme sort of like accumulating pairwise cluster edit distances between the clusters generated by R and the one we receive with FG, but we are unsure as to whether this is a reasonable measure.

Thoughts on this?

On Mon, Dec 1, 2014 at 9:26 PM, joshua vogelstein notifications@github.com wrote:

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere notifications@github.com wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508.

it's our dream openconnecto.me, jovo.me, my calendar < https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65174392.

jovo commented 9 years ago

the best way, imho, is to generate data that we know comes from a mixture of k spherically symmetric gaussians, and then compute the adjusted rand index (wiki knows what it is) between the truth and the result. you can, of course, also use real data for which you know the "true" clusterings, but the problem with that is that there are many different possible "true" clusterings in real data, so different algorithms can find equally good, but different "true" clusterings, and therefore get a poor score.

i've attached a classic paper on k-means clustering, in case that helps.

to choose the parameters of this guassian mixture model, i'd just use what they used in the k-means++ or k-means|| paper (eg, section 4.1 of the attached).

make sense?

On Mon, Dec 1, 2014 at 9:48 PM, Disa Mhembere notifications@github.com wrote:

Great question jovo. I wanted to ask you, Carey and Youngser how to best tell whether the implementation is performing well with regards to the quality of clusters. Da suggested a scheme sort of like accumulating pairwise cluster edit distances between the clusters generated by R and the one we receive with FG, but we are unsure as to whether this is a reasonable measure.

Thoughts on this?

On Mon, Dec 1, 2014 at 9:26 PM, joshua vogelstein < notifications@github.com> wrote:

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere notifications@github.com

wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65174392.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65175879.

it's our dream openconnecto.me, jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York

disa-mhembere commented 9 years ago

How to compute the adjusted rand index (ARI) makes sense from what I read. I don't fully understand (statiscally) what exactly makes it a good measure though.

So ARI near 1 is good and 0 and below is bad, but what is a considered acceptable?

I can't see the attachments. Did you forget? You can just send the title if easier.

Thanks,

On Mon, Dec 1, 2014 at 10:06 PM, joshua vogelstein <notifications@github.com

wrote:

the best way, imho, is to generate data that we know comes from a mixture of k spherically symmetric gaussians, and then compute the adjusted rand index (wiki knows what it is) between the truth and the result. you can, of course, also use real data for which you know the "true" clusterings, but the problem with that is that there are many different possible "true" clusterings in real data, so different algorithms can find equally good, but different "true" clusterings, and therefore get a poor score.

i've attached a classic paper on k-means clustering, in case that helps.

to choose the parameters of this guassian mixture model, i'd just use what they used in the k-means++ or k-means|| paper (eg, section 4.1 of the attached).

make sense?

On Mon, Dec 1, 2014 at 9:48 PM, Disa Mhembere notifications@github.com wrote:

Great question jovo. I wanted to ask you, Carey and Youngser how to best tell whether the implementation is performing well with regards to the quality of clusters. Da suggested a scheme sort of like accumulating pairwise cluster edit distances between the clusters generated by R and the one we receive with FG, but we are unsure as to whether this is a reasonable measure.

Thoughts on this?

On Mon, Dec 1, 2014 at 9:26 PM, joshua vogelstein < notifications@github.com> wrote:

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere < notifications@github.com>

wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65174392.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65175879.

it's our dream openconnecto.me, jovo.me, my calendar < https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65177040.

jovo commented 9 years ago

performance is relative, there is no absolute threshold. but yah, 1 is good, 0 is bad.

A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

talks about different metrics for clustering, in sec 6.

perhaps that helps explain why ARI is ok (note that none are perfect).

On Mon, Dec 1, 2014 at 10:52 PM, Disa Mhembere notifications@github.com wrote:

How to compute the adjusted rand index (ARI) makes sense from what I read. I don't fully understand (statiscally) what exactly makes it a good measure though.

So ARI near 1 is good and 0 and below is bad, but what is a considered acceptable?

I can't see the attachments. Did you forget? You can just send the title if easier.

Thanks,

On Mon, Dec 1, 2014 at 10:06 PM, joshua vogelstein < notifications@github.com

wrote:

the best way, imho, is to generate data that we know comes from a mixture of k spherically symmetric gaussians, and then compute the adjusted rand index (wiki knows what it is) between the truth and the result. you can, of course, also use real data for which you know the "true" clusterings, but the problem with that is that there are many different possible "true" clusterings in real data, so different algorithms can find equally good, but different "true" clusterings, and therefore get a poor score.

i've attached a classic paper on k-means clustering, in case that helps.

to choose the parameters of this guassian mixture model, i'd just use what they used in the k-means++ or k-means|| paper (eg, section 4.1 of the attached).

make sense?

On Mon, Dec 1, 2014 at 9:48 PM, Disa Mhembere notifications@github.com

wrote:

Great question jovo. I wanted to ask you, Carey and Youngser how to best tell whether the implementation is performing well with regards to the quality of clusters. Da suggested a scheme sort of like accumulating pairwise cluster edit distances between the clusters generated by R and the one we receive with FG, but we are unsure as to whether this is a reasonable measure.

Thoughts on this?

On Mon, Dec 1, 2014 at 9:26 PM, joshua vogelstein < notifications@github.com> wrote:

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere < notifications@github.com>

wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65174392>.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65175879.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65177040.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65179845.

it's our dream openconnecto.me, jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York

disa-mhembere commented 9 years ago

got it thanks

On Mon, Dec 1, 2014 at 10:59 PM, joshua vogelstein <notifications@github.com

wrote:

performance is relative, there is no absolute threshold. but yah, 1 is good, 0 is bad.

A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

talks about different metrics for clustering, in sec 6.

perhaps that helps explain why ARI is ok (note that none are perfect).

On Mon, Dec 1, 2014 at 10:52 PM, Disa Mhembere notifications@github.com wrote:

How to compute the adjusted rand index (ARI) makes sense from what I read. I don't fully understand (statiscally) what exactly makes it a good measure though.

So ARI near 1 is good and 0 and below is bad, but what is a considered acceptable?

I can't see the attachments. Did you forget? You can just send the title if easier.

Thanks,

On Mon, Dec 1, 2014 at 10:06 PM, joshua vogelstein < notifications@github.com

wrote:

the best way, imho, is to generate data that we know comes from a mixture of k spherically symmetric gaussians, and then compute the adjusted rand index (wiki knows what it is) between the truth and the result. you can, of course, also use real data for which you know the "true" clusterings, but the problem with that is that there are many different possible "true" clusterings in real data, so different algorithms can find equally good, but different "true" clusterings, and therefore get a poor score.

i've attached a classic paper on k-means clustering, in case that helps.

to choose the parameters of this guassian mixture model, i'd just use what they used in the k-means++ or k-means|| paper (eg, section 4.1 of the attached).

make sense?

On Mon, Dec 1, 2014 at 9:48 PM, Disa Mhembere < notifications@github.com>

wrote:

Great question jovo. I wanted to ask you, Carey and Youngser how to best tell whether the implementation is performing well with regards to the quality of clusters. Da suggested a scheme sort of like accumulating pairwise cluster edit distances between the clusters generated by R and the one we receive with FG, but we are unsure as to whether this is a reasonable measure.

Thoughts on this?

On Mon, Dec 1, 2014 at 9:26 PM, joshua vogelstein < notifications@github.com> wrote:

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere < notifications@github.com>

wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65174392>.

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65175879>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65177040.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65179845.

it's our dream openconnecto.me, jovo.me, my calendar < https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65180228.

disa-mhembere commented 9 years ago

I fixed the race condition in kmeans so it is deterministic now.

Let me apologize in advance for all the questions, but @jovo

I need help to generate the data from k spherically symmetric gaussians. I generally just use the the distributions provided by the c++ random header http://www.cplusplus.com/reference/random/. I'm not really familiar with what a "spherically symmetric gaussian" really is other than what wiki says. I would like some guidance please.

Also, if I understand correctly, to test with the adjusted rand index (ARI), I would:

  1. Generate x vectors of length d with values sampled from one of the k spherically symmetric gaussians. Giving me my n (k*x) samples that I want to cluster.
  2. Run kmeans to obtain k cluster vectors (i.e. the mean of all cluster constituents vector indices).
  3. Know the true clusters and I have my own so I can compute ARI.

With ARI don't I still need to know which cluster I detect corresponds to the which in the truth so I can compute ARI on a per-cluster basis? Or did I misunderstand what X and Y are on the wiki http://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index?

If I didn't misunderstand then, computing the contingency table I require |X_i \cap Y_i|. In my case X_i and Y_i are floats so what is their intersection?

I'm also available to Skype/Hangout!

Thanks and uhh happy new year :)

On Mon, Dec 1, 2014 at 11:46 PM, Disa Mhembere dmhembe1@jhu.edu wrote:

got it thanks

On Mon, Dec 1, 2014 at 10:59 PM, joshua vogelstein < notifications@github.com> wrote:

performance is relative, there is no absolute threshold. but yah, 1 is good, 0 is bad.

A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

talks about different metrics for clustering, in sec 6.

perhaps that helps explain why ARI is ok (note that none are perfect).

On Mon, Dec 1, 2014 at 10:52 PM, Disa Mhembere notifications@github.com

wrote:

How to compute the adjusted rand index (ARI) makes sense from what I read. I don't fully understand (statiscally) what exactly makes it a good measure though.

So ARI near 1 is good and 0 and below is bad, but what is a considered acceptable?

I can't see the attachments. Did you forget? You can just send the title if easier.

Thanks,

On Mon, Dec 1, 2014 at 10:06 PM, joshua vogelstein < notifications@github.com

wrote:

the best way, imho, is to generate data that we know comes from a mixture of k spherically symmetric gaussians, and then compute the adjusted rand index (wiki knows what it is) between the truth and the result. you can, of course, also use real data for which you know the "true" clusterings, but the problem with that is that there are many different possible "true" clusterings in real data, so different algorithms can find equally good, but different "true" clusterings, and therefore get a poor score.

i've attached a classic paper on k-means clustering, in case that helps.

to choose the parameters of this guassian mixture model, i'd just use what they used in the k-means++ or k-means|| paper (eg, section 4.1 of the attached).

make sense?

On Mon, Dec 1, 2014 at 9:48 PM, Disa Mhembere < notifications@github.com>

wrote:

Great question jovo. I wanted to ask you, Carey and Youngser how to best tell whether the implementation is performing well with regards to the quality of clusters. Da suggested a scheme sort of like accumulating pairwise cluster edit distances between the clusters generated by R and the one we receive with FG, but we are unsure as to whether this is a reasonable measure.

Thoughts on this?

On Mon, Dec 1, 2014 at 9:26 PM, joshua vogelstein < notifications@github.com> wrote:

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere < notifications@github.com>

wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65174392>.

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65175879>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65177040>.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65179845.

it's our dream openconnecto.me, jovo.me, my calendar < https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65180228.

jovo commented 9 years ago

I need help to generate the data from k spherically symmetric gaussians. I generally just use the the distributions provided by the c++ random header http://www.cplusplus.com/reference/random/. I'm not really familiar with what a "spherically symmetric gaussian" really is other than what wiki says. I would like some guidance please.

spherically symmetric means identity covariance matrix, and the diagonal entries are all equal. that is x ~ N(mean,sigma*Identity), for some sigma > 0.

thus, the level sets are spheres.

Also, if I understand correctly, to test with the adjusted rand index (ARI), I would:

  1. Generate x vectors of length d with values sampled from one of the k spherically symmetric gaussians. Giving me my n (k*x) samples that I want to cluster.
  2. Run kmeans to obtain k cluster vectors (i.e. the mean of all cluster constituents vector indices).
  3. Know the true clusters and I have my own so I can compute ARI.

With ARI don't I still need to know which cluster I detect corresponds to the which in the truth so I can compute ARI on a per-cluster basis? Or did I misunderstand what X and Y are on the wiki http://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index?

If I didn't misunderstand then, computing the contingency table I require |X_i \cap Y_i|. In my case X_i and Y_i are floats so what is their intersection?

X_i is the cluster # from kmeans, Y_i is the cluster # from the truth. does that make sense?

I'm also available to Skype/Hangout!

Thanks and uhh happy new year :)

On Mon, Dec 1, 2014 at 11:46 PM, Disa Mhembere dmhembe1@jhu.edu wrote:

got it thanks

On Mon, Dec 1, 2014 at 10:59 PM, joshua vogelstein < notifications@github.com> wrote:

performance is relative, there is no absolute threshold. but yah, 1 is good, 0 is bad.

A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

talks about different metrics for clustering, in sec 6.

perhaps that helps explain why ARI is ok (note that none are perfect).

On Mon, Dec 1, 2014 at 10:52 PM, Disa Mhembere < notifications@github.com>

wrote:

How to compute the adjusted rand index (ARI) makes sense from what I read. I don't fully understand (statiscally) what exactly makes it a good measure though.

So ARI near 1 is good and 0 and below is bad, but what is a considered acceptable?

I can't see the attachments. Did you forget? You can just send the title if easier.

Thanks,

On Mon, Dec 1, 2014 at 10:06 PM, joshua vogelstein < notifications@github.com

wrote:

the best way, imho, is to generate data that we know comes from a mixture of k spherically symmetric gaussians, and then compute the adjusted rand index (wiki knows what it is) between the truth and the result. you can, of course, also use real data for which you know the "true" clusterings, but the problem with that is that there are many different possible "true" clusterings in real data, so different algorithms can find equally good, but different "true" clusterings, and therefore get a poor score.

i've attached a classic paper on k-means clustering, in case that helps.

to choose the parameters of this guassian mixture model, i'd just use what they used in the k-means++ or k-means|| paper (eg, section 4.1 of the attached).

make sense?

On Mon, Dec 1, 2014 at 9:48 PM, Disa Mhembere < notifications@github.com>

wrote:

Great question jovo. I wanted to ask you, Carey and Youngser how to best tell whether the implementation is performing well with regards to the quality of clusters. Da suggested a scheme sort of like accumulating pairwise cluster edit distances between the clusters generated by R and the one we receive with FG, but we are unsure as to whether this is a reasonable measure.

Thoughts on this?

On Mon, Dec 1, 2014 at 9:26 PM, joshua vogelstein < notifications@github.com> wrote:

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere < notifications@github.com>

wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub <

https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65174392>.

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65175879>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65177040>.

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65179845>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-65180228.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-68625320.

it's our dream openconnecto.me, jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York

zheng-da commented 9 years ago

what are you doing, man? didn't you get flu and shouldn't you be on the bed now?

On Sun, Jan 4, 2015 at 4:35 PM, joshua vogelstein notifications@github.com wrote:

I need help to generate the data from k spherically symmetric gaussians. I generally just use the the distributions provided by the c++ random header http://www.cplusplus.com/reference/random/. I'm not really familiar with what a "spherically symmetric gaussian" really is other than what wiki says. I would like some guidance please.

spherically symmetric means identity covariance matrix, and the diagonal entries are all equal. that is x ~ N(mean,sigma*Identity), for some sigma > 0.

thus, the level sets are spheres.

Also, if I understand correctly, to test with the adjusted rand index (ARI), I would:

  1. Generate x vectors of length d with values sampled from one of the k spherically symmetric gaussians. Giving me my n (k*x) samples that I want to cluster.
  2. Run kmeans to obtain k cluster vectors (i.e. the mean of all cluster constituents vector indices).
  3. Know the true clusters and I have my own so I can compute ARI.

With ARI don't I still need to know which cluster I detect corresponds to the which in the truth so I can compute ARI on a per-cluster basis? Or did I misunderstand what X and Y are on the wiki http://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index?

If I didn't misunderstand then, computing the contingency table I require |X_i \cap Y_i|. In my case X_i and Y_i are floats so what is their intersection?

X_i is the cluster # from kmeans, Y_i is the cluster # from the truth. does that make sense?

I'm also available to Skype/Hangout!

Thanks and uhh happy new year :)

On Mon, Dec 1, 2014 at 11:46 PM, Disa Mhembere dmhembe1@jhu.edu wrote:

got it thanks

On Mon, Dec 1, 2014 at 10:59 PM, joshua vogelstein < notifications@github.com> wrote:

performance is relative, there is no absolute threshold. but yah, 1 is good, 0 is bad.

A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

talks about different metrics for clustering, in sec 6.

perhaps that helps explain why ARI is ok (note that none are perfect).

On Mon, Dec 1, 2014 at 10:52 PM, Disa Mhembere < notifications@github.com>

wrote:

How to compute the adjusted rand index (ARI) makes sense from what I read. I don't fully understand (statiscally) what exactly makes it a good measure though.

So ARI near 1 is good and 0 and below is bad, but what is a considered acceptable?

I can't see the attachments. Did you forget? You can just send the title if easier.

Thanks,

On Mon, Dec 1, 2014 at 10:06 PM, joshua vogelstein < notifications@github.com

wrote:

the best way, imho, is to generate data that we know comes from a mixture of k spherically symmetric gaussians, and then compute the adjusted rand index (wiki knows what it is) between the truth and the result. you can, of course, also use real data for which you know the "true" clusterings, but the problem with that is that there are many different possible "true" clusterings in real data, so different algorithms can find equally good, but different "true" clusterings, and therefore get a poor score.

i've attached a classic paper on k-means clustering, in case that helps.

to choose the parameters of this guassian mixture model, i'd just use what they used in the k-means++ or k-means|| paper (eg, section 4.1 of the attached).

make sense?

On Mon, Dec 1, 2014 at 9:48 PM, Disa Mhembere < notifications@github.com>

wrote:

Great question jovo. I wanted to ask you, Carey and Youngser how to best tell whether the implementation is performing well with regards to the quality of clusters. Da suggested a scheme sort of like accumulating pairwise cluster edit distances between the clusters generated by R and the one we receive with FG, but we are unsure as to whether this is a reasonable measure.

Thoughts on this?

On Mon, Dec 1, 2014 at 9:26 PM, joshua vogelstein < notifications@github.com> wrote:

great. speed sounds good. how about accuracy?

On Mon, Dec 1, 2014 at 6:41 PM, Disa Mhembere < notifications@github.com>

wrote:

Speed

  • Matrix Dim = 124836180 x 5
  • K = 5

FG (32 threads)

fg.kmeans ( || Lloyd): 23.146s R

kmeans (MacQueen): 287.347 (12 times slower) kmens (Lloyd): 421.840 (18 times slower)

— Reply to this email directly or view it on GitHub <

https://github.com/icoming/FlashGraph/issues/50#issuecomment-65159508>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub <

https://github.com/icoming/FlashGraph/issues/50#issuecomment-65174392>.

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65175879>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65177040>.

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65179845>.

it's our dream openconnecto.me, jovo.me, my calendar <

https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub < https://github.com/icoming/FlashGraph/issues/50#issuecomment-65180228>.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-68625320.

it's our dream openconnecto.me, jovo.me, my calendar < https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York>

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-68625413.

zheng-da commented 9 years ago

you guys are living in chinese time :)

jovo commented 9 years ago

the time zone that i'm in changes every few hours so i can stay up and work all night :)

On Sun, Jan 4, 2015 at 3:45 AM, Da Zheng notifications@github.com wrote:

you guys are living in chinese time :)

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-68625648.

it's our dream openconnecto.me, jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York

disa-mhembere commented 9 years ago

Is it weird that I'm not shocked jovo is still up haha!? Thanks for the response! I couldn't read your technical response because I did get into bed and am now on my phone. I will respond in a few hours when I get up but I think Da made a great point. I should just live on Chinese time permanently! And the flu medicine is obviously doing its job well :)

Disa Mhembere Johns Hopkins University :(){:|:&};:

On Jan 4, 2015, at 3:46 AM, joshua vogelstein notifications@github.com wrote:

the time zone that i'm in changes every few hours so i can stay up and work all night :)

On Sun, Jan 4, 2015 at 3:45 AM, Da Zheng notifications@github.com wrote:

you guys are living in chinese time :)

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-68625648.

it's our dream openconnecto.me, jovo.me, my calendar https://www.google.com/calendar/embed?src=joshuav%40gmail.com&ctz=America/New_York — Reply to this email directly or view it on GitHub.

disa-mhembere commented 9 years ago

Getting back to this jovo. I wrote the ARI testing procedure in R. Would you be able to send an R snippet how to generate the test data? I'm still having trouble generating spherically symmetric gaussians.

Thanks

jovo commented 9 years ago

mvrnorm in R is the function.

for Sigma, you provide k*identity matrix, where k is some scalar that characterizes the variance.

so, each cluster has a different mean, but the same k*identity for the covariance matrix.

does that make sense?

an example is in the help page for mvrnorm: http://stat.ethz.ch/R-manual/R-devel/library/MASS/html/mvrnorm.html

but the covariance matrix there is not spherically symmetric.

On Sun Jan 18 2015 at 10:57:27 AM Disa Mhembere notifications@github.com wrote:

Getting back to this jovo. I wrote the ARI testing procedure in R. Would you be able to send an R snippet how to generate the test data? I'm still having trouble generating spherically symmetric gaussians.

Thanks

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70413312.

jovo commented 9 years ago

for class 1:

library("MASS") mu1=c(1,1) Sigma <- 2*matrix(c(1,0,0,1),2,2) mvrnorm(n=50, mu1, Sigma)

for class 2, just change mu. make sense? if you want, you can sample each mu: muvec = mvrnorm(NumClusters,c(0,0),StanDev*diag(2)) that gives you NumClusters means, one per cluster.

does that help more?

On Sun Jan 18 2015 at 11:09:42 AM joshua vogelstein jovo@jhu.edu wrote:

mvrnorm in R is the function.

for Sigma, you provide k*identity matrix, where k is some scalar that characterizes the variance.

so, each cluster has a different mean, but the same k*identity for the covariance matrix.

does that make sense?

an example is in the help page for mvrnorm: http://stat.ethz.ch/R-manual/R-devel/library/MASS/html/mvrnorm.html

but the covariance matrix there is not spherically symmetric.

On Sun Jan 18 2015 at 10:57:27 AM Disa Mhembere notifications@github.com wrote:

Getting back to this jovo. I wrote the ARI testing procedure in R. Would you be able to send an R snippet how to generate the test data? I'm still having trouble generating spherically symmetric gaussians.

Thanks

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70413312.

disa-mhembere commented 9 years ago

Awesome! Very helpful. I didn't know about mvrnorm only rnorm.

Each class in this example will correspond to a cluster? The result of mvnorm is a matrix so if I use each row of a class as a row within the big matrix I am trying to cluster by row, I should be able to reconstruct each class correct?

jovo commented 9 years ago

you should generate 1 matrix PER class. because each mvrnorm will output n samples from a particular gaussian, and each gaussian is one of the classes. so, then you concatenate all the matrices, and THEN cluster by row.

make sense?

the mclust function 'sim' will generate a mixture for you. to make all the clusters have the same spherically symmetric gaussian covariances, use the model "EII"

then you just have to specify the means (as i described before) and the probability of being in each class (which you can make simply 1/NumClusters for simplicity).

make sense?

On Sun Jan 18 2015 at 11:48:01 AM Disa Mhembere notifications@github.com wrote:

Awesome! Very helpful. I didn't know about mvrnorm only rnorm.

Each class in this example will correspond to a cluster? The result of mvnorm is a matrix so if I use each row of a class as a row within the big matrix I am trying to cluster by row, I should be able to reconstruct each class correct?

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70415373.

disa-mhembere commented 9 years ago

Mmmh ... This might be better done in person. I understood the first part, but not the mclust part. I thought I had all the data I needed already without mclust::sim. I don't understand what its doing.

i.e I thought doing for k=2 was sufficient. data = rbind(mvrnorm(n, mu1, Sigma), mvrnorm(n, mu2, Sigma)) clusters = fg.kmeans(data, k=2)

Should I just come to your office the next day you're in?

jovo commented 9 years ago

I think that would work. mclust was just another option. sorry to confuse. On Sun, Jan 18, 2015 at 12:29 PM Disa Mhembere notifications@github.com wrote:

Mmmh ... This might be better done in person. I understood the first part, but not the mclust part. I thought I had all the data I needed already without mclust::sim. I don't understand what its doing.

i.e I thought doing for k=2 was sufficient. data = rbind(mvrnorm(n, mu1, Sigma), mvrnorm(n, mu2, Sigma)) clusters = fg.kmeans(data, k=2)

Should I just come to your office the next day you're in?

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70416940.

disa-mhembere commented 9 years ago

I'm getting super low ARIs for even for R's kmeans, but both fg.kmeans and R.kmeans have similar values which is good, but I think the data I generate is wrong.

I assumed the data would look linearly separable, but this does not look that way. Could to take a look at the gist https://gist.github.com/dmhembe1/da5aa83ba4409b7f909a? It will run with /without FG installed so please run it to visualize or see attachments.

Thanks

On Sun, Jan 18, 2015 at 1:30 PM, joshua vogelstein <notifications@github.com

wrote:

I think that would work. mclust was just another option. sorry to confuse. On Sun, Jan 18, 2015 at 12:29 PM Disa Mhembere notifications@github.com wrote:

Mmmh ... This might be better done in person. I understood the first part, but not the mclust part. I thought I had all the data I needed already without mclust::sim. I don't understand what its doing.

i.e I thought doing for k=2 was sufficient. data = rbind(mvrnorm(n, mu1, Sigma), mvrnorm(n, mu2, Sigma)) clusters = fg.kmeans(data, k=2)

Should I just come to your office the next day you're in?

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70416940.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70419335.

jovo commented 9 years ago

start by having just 2 clusters, plots samples in 2 different colors, and make sure that the cluster variance is smaller than the variance between means, and show me that plot?

On Mon Jan 19 2015 at 2:22:35 PM Disa Mhembere notifications@github.com wrote:

I'm getting super low ARIs for even for R's kmeans, but both fg.kmeans and R.kmeans have similar values which is good, but I think the data I generate is wrong.

I assumed the data would look linearly separable, but this does not look that way. Could to take a look at the gist https://gist.github.com/dmhembe1/da5aa83ba4409b7f909a? It will run with /without FG installed so please run it to visualize or see attachments.

Thanks

On Sun, Jan 18, 2015 at 1:30 PM, joshua vogelstein < notifications@github.com

wrote:

I think that would work. mclust was just another option. sorry to confuse. On Sun, Jan 18, 2015 at 12:29 PM Disa Mhembere notifications@github.com

wrote:

Mmmh ... This might be better done in person. I understood the first part, but not the mclust part. I thought I had all the data I needed already without mclust::sim. I don't understand what its doing.

i.e I thought doing for k=2 was sufficient. data = rbind(mvrnorm(n, mu1, Sigma), mvrnorm(n, mu2, Sigma)) clusters = fg.kmeans(data, k=2)

Should I just come to your office the next day you're in?

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70416940.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70419335.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70546285.

disa-mhembere commented 9 years ago

[...] and make sure that the cluster variance is smaller than the variance

between means,

Is this in reference to the truth data i.e., before clustering?

jovo commented 9 years ago

yah, truth. always look at truth prior to trying to run any analysis.

On Mon Jan 19 2015 at 4:51:34 PM Disa Mhembere notifications@github.com wrote:

[...] and make sure that the cluster variance is smaller than the variance

between means,

Is this in reference to the truth data i.e., before clustering?

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70567578.

disa-mhembere commented 9 years ago

Ok that seems to have worked, but the data is not always as linearly separable as in the 2 cluster case (see attachments). I think its reasonable to say that if our fg.kmeans is within ~5% of R's ARI on the data then it can pass the test. @icoming sound reasonable now that we are getting high ARIs? In fact on all the data I have tested fg is better than or equal to the R implementation.

jovo commented 9 years ago

where are attachments?

disa-mhembere commented 9 years ago

I can see them in the email chain. Use gmail not github you should see them.

On Mon, Jan 19, 2015 at 6:36 PM, joshua vogelstein <notifications@github.com

wrote:

where are attachments?

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70580088.

disa-mhembere commented 9 years ago

2clres copy 2cltruth copy 5clres copy 5cltruth copy

Here are the images in github as well now.

jovo commented 9 years ago

I'm pretty satisfied. worth checking with a known data set too. try the "iris" data, it is built in to R. then we know what kmeans should do.

On Mon, Jan 19, 2015 at 7:38 PM Disa Mhembere notifications@github.com wrote:

[image: 2clres copy] https://cloud.githubusercontent.com/assets/1793828/5810330/971f44da-a012-11e4-83cb-e87598cdb52a.png [image: 2cltruth copy] https://cloud.githubusercontent.com/assets/1793828/5810332/9721159e-a012-11e4-9ab8-373c08a9708b.png [image: 5clres copy] https://cloud.githubusercontent.com/assets/1793828/5810331/9720cad0-a012-11e4-9ecc-8298b9197a4e.png [image: 5cltruth copy] https://cloud.githubusercontent.com/assets/1793828/5810333/97264fbe-a012-11e4-9a45-14e853b4de62.png

Here are the images in github as well now.

— Reply to this email directly or view it on GitHub https://github.com/icoming/FlashGraph/issues/50#issuecomment-70585737.

disa-mhembere commented 9 years ago

Ok on "iris" the result is: [1] "The ARI of R.kms = 0.73023827228347" [1] "The ARI of fg.kms = 0.707362639117314"

We are a little worse but it's within reason.