GGiecold-zz / Cluster_Ensembles

A package for combining multiple partitions into a consolidated clustering. The combinatorial optimization problem of obtaining such a consensus clustering is reformulated in terms of approximation algorithms for graph or hyper-graph partitioning.
MIT License
69 stars 43 forks source link

Error No such file or directory: 'wgraph_HGPA.part.16' #8

Closed puli83 closed 7 years ago

puli83 commented 7 years ago

Hi,

I try to run the exemple (also to other datas), but I get always this error.

Do you have some idea how can I solve?


`cluster_runs = np.random.randint(0, 50, (50, 15000))

consensus_clustering_labels = CE.cluster_ensembles(cluster_runs, verbose = True, N_clusters_max = 50)

INFO: Cluster_Ensembles: cluster_ensembles: due to a rather large number of cells in your data-set, using only 'HyperGraph Partitioning Algorithm' (HGPA) and 'Meta-CLustering Algorithm' (MCLA) as ensemble consensus functions.


INFO: Cluster_Ensembles: HGPA: consensus clustering using HGPA.

# INFO: Cluster_Ensembles: wgraph: writing wgraph_HGPA. INFO: Cluster_Ensembles: wgraph: 15000 vertices and 2500 non-zero hyper-edges. #

# INFO: Cluster_Ensembles: sgraph: calling shmetis for hypergraph partitioning. Traceback (most recent call last):

File "", line 1, in consensus_clustering_labels = CE.cluster_ensembles(cluster_runs, verbose = True, N_clusters_max = 50)

File "/usr/local/lib/python2.7/dist-packages/Cluster_Ensembles/Cluster_Ensembles.py", line 300, in cluster_ensembles cluster_ensemble.append(consensus_functions[i](hdf5_file_name, cluster_runs, verbose, N_clusters_max))

File "/usr/local/lib/python2.7/dist-packages/Cluster_Ensembles/Cluster_Ensembles.py", line 648, in HGPA return hmetis(hdf5_file_name, N_clusters_max)

File "/usr/local/lib/python2.7/dist-packages/Cluster_Ensembles/Cluster_Ensembles.py", line 973, in hmetis labels = sgraph(N_clusters_max, file_name)

File "/usr/local/lib/python2.7/dist-packages/Cluster_Ensembles/Cluster_Ensembles.py", line 1201, in sgraph with open(out_name, 'r') as file:

IOError: [Errno 2] No such file or directory: 'wgraph_HGPA.part.50'`

puli83 commented 7 years ago

I solve the problem, just reading better how to install. sorry

GGiecold-zz commented 7 years ago

Glad to learn you're interested in using this package and managed to install it.

With kind regards,

Gregory On Feb 17, 2017 4:21 PM, "puli83" notifications@github.com wrote:

I solve the problem, just reading better how to install. sorry

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-280769745, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j5_78jECey5BbdWquH-W4ZkyD_fUiks5rdg9XgaJpZM4MDhzB .

puli83 commented 7 years ago

​​ Hi,

I'm asking if we can use Cluster Ensemble to find the best number of k.

I try 100 different clusterer function, and I have 100 different cluster label vector. each cluster label vector should have from k=2 to k= 20.

I have two choices:

1) I feed Ce.cluster_ensemble() python function with all 100 cluster label vector. But I repeat this function 19 times, changing this parameter N_clusters_max from 2 to 20 (according with the different number of possible k).

This means that I have 19 different ANMI for MLCA (and others), one for each function having N_clusters_max = k.

So can I use this information to pick the best number of k? So, the max ANMI should me indicate the best numeber of k to use?

2) I reproduce the same previous experiment, but, instead of feeding Ce.cluster_ensemble() python function with all 100 cluster label vector, I chose a set of that according with the number of k which it contain. For exemple, for N_clusters_max = 2, I take the subset of cluster label vector that respect the condition to have k = 2, and I exclude cluster label vector containing *k = 3, or *k = 4, etc.

I suppose that the first one 1) is a better solution, but I wonder if we can really use this method to find the best number of k.

Thank to answer me, if possible.

2017-02-17 16:36 GMT-05:00 Gregory Giecold notifications@github.com:

Glad to learn you're interested in using this package and managed to install it.

With kind regards,

Gregory On Feb 17, 2017 4:21 PM, "puli83" notifications@github.com wrote:

I solve the problem, just reading better how to install. sorry

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-280769745, or mute the thread https://github.com/notifications/unsubscribe- auth/AK3j5_78jECey5BbdWquH-W4ZkyD_fUiks5rdg9XgaJpZM4MDhzB .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-280773186, or mute the thread https://github.com/notifications/unsubscribe-auth/AOrTTAcbllKVUuJXpbHZhonCcSBaHgAHks5rdhLqgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

GGiecold-zz commented 7 years ago

The parameter N_clusters_max is an upper bound on the number of different clusters that might appear in the consensus clustering computed from an ensemble of partitions.

In the case at hand, you have partitions consisting in up to 20 distinct groupings. A call to cluster_ensemble will result in 3 different heuristic algorithms to be subjected onto the graphs and hypergraphs associated with the ensemble of partitions provided at input. Those (hyper-) graph partitioning heuristics are dubbed CSPA, HGPA, MCLA. Each of those produces a score, namely the average of the mutual information between each partition from your ensemble of clusterings and the consensus clustering produced as an attempt of trying to solve a hard combinatorial optimization problem via the said (hyper-)graph partitioning. Those 3 scores are then compared and is chosen as the final consensus clustering the solution produced by the highest-scoring among CSPA, HGPA and MCLA. Concerning the set of problems we were facing in my former computational biology group, MCLA was usually the highest performer.

All in all, if you have confidence that all clusterings in an ensemble can be trusted as providing meaningful and not too noisy perspectives on your dataset, it would be advisable to pool them all into an ensemble and let Cluster_Ensembles decide on the best consensus clustering (this includes the optimal number of clusters). This is typically done by leaving parameter N_clusters_max unspecified (it will default to the highest number of clusters found in any partition of your ensemble). Note that this recommendation applies if you have several partitions having the same number of clusters. For instance, you might have decided to run several runs of k-means with k set to 3. You realize that each produces rather different partitions of your dataset. This is expected and Cluster_Ensembles is meant to address this situation by producing a consensus clustering with improved statistical generalization. The same hold by pooling partitions with different numbers of clusters (a point unfortunately not illustrated in our publication where we focused on pooling partitions of an identical number of clusters).

My understanding is that for a good signal to noise ratio, the strategy you are describing in point 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) or the equivalent approach described in the above paragraph. I would not advise it though: it reduces the size of your ensemble, which will in turn affect MCLA by increasing the noise in the process of assigning each datum to a metacluster.

All the best with your research undertakings,

Gregory

On 2/20/17 3:52 PM, puli83 wrote:

​​ Hi,

I'm asking if we can use Cluster Ensemble to find the best number of k.

I try 100 different clusterer function, and I have 100 different cluster label vector. each cluster label vector should have from k=2 to k= 20.

I have two choices:

1) I feed Ce.cluster_ensemble() python function with all 100 cluster label vector. But I repeat this function 19 times, changing this parameter N_clusters_max from 2 to 20 (according with the different number of possible k).

This means that I have 19 different ANMI for MLCA (and others), one for each function having N_clusters_max = k.

So can I use this information to pick the best number of k? So, the max ANMI should me indicate the best numeber of k to use?

2) I reproduce the same previous experiment, but, instead of feeding Ce.cluster_ensemble() python function with all 100 cluster label vector, I chose a set of that according with the number of k which it contain. For exemple, for N_clusters_max = 2, I take the subset of cluster label vector that respect the condition to have k = 2, and I exclude cluster label vector containing *k = 3, or *k = 4, etc.

I suppose that the first one 1) is a better solution, but I wonder if we can really use this method to find the best number of k.

Thank to answer me, if possible.

2017-02-17 16:36 GMT-05:00 Gregory Giecold notifications@github.com:

Glad to learn you're interested in using this package and managed to install it.

With kind regards,

Gregory On Feb 17, 2017 4:21 PM, "puli83" notifications@github.com wrote:

I solve the problem, just reading better how to install. sorry

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-280769745, or mute the thread https://github.com/notifications/unsubscribe- auth/AK3j5_78jECey5BbdWquH-W4ZkyD_fUiks5rdg9XgaJpZM4MDhzB .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-280773186, or mute the thread https://github.com/notifications/unsubscribe-auth/AOrTTAcbllKVUuJXpbHZhonCcSBaHgAHks5rdhLqgaJpZM4MDhzB .

puli83 commented 7 years ago

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k = 2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

GGiecold-zz commented 7 years ago

Could you please provide the plots referred to in your email?

Comparing the ANMI computed for all partitions with, e.g., k=3 to the ANMI resulting from an ensemble encompassing all partitions with, say, k=5 is unwarranted. The partitions in the first group might be more similar to each other, whereas those from the second set might exhibit more within-group variability. This is bound to guarantee a higher ANMI score for the first group but is not indicative of k=3 being a better description of your dataset than k=5.

The whole issue is tricky but I would still advise pooling all your partitions into a big ensemble, possibly after removing some partitions. For instance, if all k=10 partitions are almost identical, keep just a few.

Gregory On Feb 21, 2017 10:46 AM, "puli83" notifications@github.com wrote:

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k = 2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-281382632, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j56pqlprz5xdhXZqU1t9vHEBBw5Syks5rewbegaJpZM4MDhzB .

puli83 commented 7 years ago

[image: Immagine incorporata 1][image: Immagine incorporata 2] These are the plots, can you visualize them now?

I read your answer now

2017-02-21 11:15 GMT-05:00 Gregory Giecold notifications@github.com:

Could you please provide the plots referred to in your email?

Comparing the ANMI computed for all partitions with, e.g., k=3 to the ANMI resulting from an ensemble encompassing all partitions with, say, k=5 is unwarranted. The partitions in the first group might be more similar to each other, whereas those from the second set might exhibit more within-group variability. This is bound to guarantee a higher ANMI score for the first group but is not indicative of k=3 being a better description of your dataset than k=5.

The whole issue is tricky but I would still advise pooling all your partitions into a big ensemble, possibly after removing some partitions. For instance, if all k=10 partitions are almost identical, keep just a few.

Gregory

On Feb 21, 2017 10:46 AM, "puli83" notifications@github.com wrote:

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k = 2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281382632, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56pqlprz5xdhXZqU1t9vHEBBw5Syks5rewbegaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-281391888, or mute the thread https://github.com/notifications/unsubscribe-auth/AOrTTO6cxxc5C4yMr3_KcCCFfzQv9yFrks5rew2LgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

GGiecold-zz commented 7 years ago

I cannot see them on my smartphone.

Gregory On Feb 21, 2017 11:16 AM, "puli83" notifications@github.com wrote:

[image: Immagine incorporata 1][image: Immagine incorporata 2] These are the plots, can you visualize them now?

I read your answer now

2017-02-21 11:15 GMT-05:00 Gregory Giecold notifications@github.com:

Could you please provide the plots referred to in your email?

Comparing the ANMI computed for all partitions with, e.g., k=3 to the ANMI resulting from an ensemble encompassing all partitions with, say, k=5 is unwarranted. The partitions in the first group might be more similar to each other, whereas those from the second set might exhibit more within-group variability. This is bound to guarantee a higher ANMI score for the first group but is not indicative of k=3 being a better description of your dataset than k=5.

The whole issue is tricky but I would still advise pooling all your partitions into a big ensemble, possibly after removing some partitions. For instance, if all k=10 partitions are almost identical, keep just a few.

Gregory

On Feb 21, 2017 10:46 AM, "puli83" notifications@github.com wrote:

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k = 2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281382632, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56pqlprz5xdhXZqU1t9vHEBBw5Syks5rewbegaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281391888, or mute the thread https://github.com/notifications/unsubscribe-auth/AOrTTO6cxxc5C4yMr3_ KcCCFfzQv9yFrks5rew2LgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-281392503, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j56YQ5LPT7iXauOm84tdRXmG6ZxDhks5rew30gaJpZM4MDhzB .

puli83 commented 7 years ago

Ok, so I understand it's better to feed algorithm with a big ensemble of partitions instead of choosing 10 partitions corresponding to k = n.

One of these plots shows that AvMutualInfor is smooth after a certain number of k. Differences between k = 23 and k = 44 are very small in term of highest AvMutualInform. I think that this shows that, taking a big ensemble of partitions (490 different partitions), the consensus clustering reach a ceiling at a certain number of k. And maybe I can use this to chose best number of k avoiding to pick up complex partition (such as partitions having a big value of k. ex. k = 44 or k = 50).

Just I am not yet sure if it is normal to have the highest AvMutualInfor (MCLA at 0.642476878275) corresponding to k = 50 using N_clusters_max = None, but , when I extract a specific AvMutualInfor for each N_clusters_max = n, I have a the highest AvMutualInfor (MCLA', 44, 0.6566595470531067) corresponding to k = 44. I mean, I expect to have the highest AvMutualInfor at k = 50, don't you?

If you will take a look to plots could be nice.

in any case, thank you very much for yours answers

2017-02-21 11:26 GMT-05:00 Gregory Giecold notifications@github.com:

I cannot see them on my smartphone.

Gregory

On Feb 21, 2017 11:16 AM, "puli83" notifications@github.com wrote:

[image: Immagine incorporata 1][image: Immagine incorporata 2] These are the plots, can you visualize them now?

I read your answer now

2017-02-21 11:15 GMT-05:00 Gregory Giecold notifications@github.com:

Could you please provide the plots referred to in your email?

Comparing the ANMI computed for all partitions with, e.g., k=3 to the ANMI resulting from an ensemble encompassing all partitions with, say, k=5 is unwarranted. The partitions in the first group might be more similar to each other, whereas those from the second set might exhibit more within-group variability. This is bound to guarantee a higher ANMI score for the first group but is not indicative of k=3 being a better description of your dataset than k=5.

The whole issue is tricky but I would still advise pooling all your partitions into a big ensemble, possibly after removing some partitions. For instance, if all k=10 partitions are almost identical, keep just a few.

Gregory

On Feb 21, 2017 10:46 AM, "puli83" notifications@github.com wrote:

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k = 2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281382632, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56pqlprz5xdhXZqU1t9vHEBBw5Syks5rewbegaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281391888, or mute the thread https://github.com/notifications/unsubscribe-auth/AOrTTO6cxxc5C4yMr3_ KcCCFfzQv9yFrks5rew2LgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281392503, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56YQ5LPT7iXauOm84tdRXmG6ZxDhks5rew30gaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-281395784, or mute the thread https://github.com/notifications/unsubscribe-auth/AOrTTNldR5P8B1fWUQV71XDJ5ViKAJdLks5rexBNgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

GGiecold-zz commented 7 years ago

About the plot levelling off after some value of k: yes, you could pick a value of k corresponding to the shoulder region. However, pretty like in a gap statistics evaluation, I would recommend you perform several runs of Cluster-Ensemble on your ensemble of partitions, then plot the average of the ANMI curves thereby obtained and select the smallest value of k such that

avg_ANMI[k + 1] < avg ANMI[k] + alpha * std_ANMI[k],

where alpha is generally set to 1. I have personally resorted to this kind of analysis with Cluster-Ensemble, with alpha at times chosen to be as little as 0.2. Hastie etc al. themselves mention in the gap statistics paper that choosing alpha is a rather moot issue.

All the best,

Gregory On Feb 21, 2017 12:28 PM, "puli83" notifications@github.com wrote:

Ok, so I understand it's better to feed algorithm with a big ensemble of partitions instead of choosing 10 partitions corresponding to k = n.

One of these plots shows that AvMutualInfor is smooth after a certain number of k. Differences between k = 23 and k = 44 are very small in term of highest AvMutualInform. I think that this shows that, taking a big ensemble of partitions (490 different partitions), the consensus clustering reach a ceiling at a certain number of k. And maybe I can use this to chose best number of k avoiding to pick up complex partition (such as partitions having a big value of k. ex. k = 44 or k = 50).

Just I am not yet sure if it is normal to have the highest AvMutualInfor (MCLA at 0.642476878275) corresponding to k = 50 using N_clusters_max = None, but , when I extract a specific AvMutualInfor for each N_clusters_max = n, I have a the highest AvMutualInfor (MCLA', 44, 0.6566595470531067) corresponding to k = 44. I mean, I expect to have the highest AvMutualInfor at k = 50, don't you?

If you will take a look to plots could be nice.

in any case, thank you very much for yours answers

2017-02-21 11:26 GMT-05:00 Gregory Giecold notifications@github.com:

I cannot see them on my smartphone.

Gregory

On Feb 21, 2017 11:16 AM, "puli83" notifications@github.com wrote:

[image: Immagine incorporata 1][image: Immagine incorporata 2] These are the plots, can you visualize them now?

I read your answer now

2017-02-21 11:15 GMT-05:00 Gregory Giecold notifications@github.com:

Could you please provide the plots referred to in your email?

Comparing the ANMI computed for all partitions with, e.g., k=3 to the ANMI resulting from an ensemble encompassing all partitions with, say, k=5 is unwarranted. The partitions in the first group might be more similar to each other, whereas those from the second set might exhibit more within-group variability. This is bound to guarantee a higher ANMI score for the first group but is not indicative of k=3 being a better description of your dataset than k=5.

The whole issue is tricky but I would still advise pooling all your partitions into a big ensemble, possibly after removing some partitions. For instance, if all k=10 partitions are almost identical, keep just a few.

Gregory

On Feb 21, 2017 10:46 AM, "puli83" notifications@github.com wrote:

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k = 2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281382632, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56pqlprz5xdhXZqU1t9vHEBBw5Syks5rewbegaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281391888, or mute the thread https://github.com/notifications/unsubscribe- auth/AOrTTO6cxxc5C4yMr3_ KcCCFfzQv9yFrks5rew2LgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281392503, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56YQ5LPT7iXauOm84tdRXmG6ZxDhks5rew30gaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281395784, or mute the thread https://github.com/notifications/unsubscribe-auth/ AOrTTNldR5P8B1fWUQV71XDJ5ViKAJdLks5rexBNgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-281415452, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j56mfWqwCcaX9wv6ajLdEJkGaTT5Tks5rex7QgaJpZM4MDhzB .

GGiecold-zz commented 7 years ago

Addendum: you'll find a reliable Python implementation of the gap statistics in the PySCUBA package. Please check it out on Github. Some of my colleagues at Harvard and the DFCI reported much more satisfactory performance than with some R implementation they were formerly using.

Gregory On Feb 21, 2017 12:28 PM, "puli83" notifications@github.com wrote:

Ok, so I understand it's better to feed algorithm with a big ensemble of partitions instead of choosing 10 partitions corresponding to k = n.

One of these plots shows that AvMutualInfor is smooth after a certain number of k. Differences between k = 23 and k = 44 are very small in term of highest AvMutualInform. I think that this shows that, taking a big ensemble of partitions (490 different partitions), the consensus clustering reach a ceiling at a certain number of k. And maybe I can use this to chose best number of k avoiding to pick up complex partition (such as partitions having a big value of k. ex. k = 44 or k = 50).

Just I am not yet sure if it is normal to have the highest AvMutualInfor (MCLA at 0.642476878275) corresponding to k = 50 using N_clusters_max = None, but , when I extract a specific AvMutualInfor for each N_clusters_max = n, I have a the highest AvMutualInfor (MCLA', 44, 0.6566595470531067) corresponding to k = 44. I mean, I expect to have the highest AvMutualInfor at k = 50, don't you?

If you will take a look to plots could be nice.

in any case, thank you very much for yours answers

2017-02-21 11:26 GMT-05:00 Gregory Giecold notifications@github.com:

I cannot see them on my smartphone.

Gregory

On Feb 21, 2017 11:16 AM, "puli83" notifications@github.com wrote:

[image: Immagine incorporata 1][image: Immagine incorporata 2] These are the plots, can you visualize them now?

I read your answer now

2017-02-21 11:15 GMT-05:00 Gregory Giecold notifications@github.com:

Could you please provide the plots referred to in your email?

Comparing the ANMI computed for all partitions with, e.g., k=3 to the ANMI resulting from an ensemble encompassing all partitions with, say, k=5 is unwarranted. The partitions in the first group might be more similar to each other, whereas those from the second set might exhibit more within-group variability. This is bound to guarantee a higher ANMI score for the first group but is not indicative of k=3 being a better description of your dataset than k=5.

The whole issue is tricky but I would still advise pooling all your partitions into a big ensemble, possibly after removing some partitions. For instance, if all k=10 partitions are almost identical, keep just a few.

Gregory

On Feb 21, 2017 10:46 AM, "puli83" notifications@github.com wrote:

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k = 2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281382632, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56pqlprz5xdhXZqU1t9vHEBBw5Syks5rewbegaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281391888, or mute the thread https://github.com/notifications/unsubscribe- auth/AOrTTO6cxxc5C4yMr3_ KcCCFfzQv9yFrks5rew2LgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281392503, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56YQ5LPT7iXauOm84tdRXmG6ZxDhks5rew30gaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281395784, or mute the thread https://github.com/notifications/unsubscribe-auth/ AOrTTNldR5P8B1fWUQV71XDJ5ViKAJdLks5rexBNgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-281415452, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j56mfWqwCcaX9wv6ajLdEJkGaTT5Tks5rex7QgaJpZM4MDhzB .

puli83 commented 7 years ago

hi,

Thank you very much for all these information. I will try the gap statistic in python. I had problem with an R implementation of that.

I found cluster ensemble a nice method to reclustering data, and in a certain sense to find the optimal number of k. I will try to compare gap statistic with cluster ensemble selection of optimal k. Anyway, I understand that this method is better used for reclustering, for optimize partition from knowledge reuse. I found it very interesting. I will present your cluster ensemble to my team here in UQAM university, also because the concept of "consensus clustering" is interesting, and your method seems to really have good performance.

Thank you for all your time you give to me,

and for all the work done on this python package, it' very appreciated

all the best,

2017-02-21 16:36 GMT-05:00 Gregory Giecold notifications@github.com:

Addendum: you'll find a reliable Python implementation of the gap statistics in the PySCUBA package. Please check it out on Github. Some of my colleagues at Harvard and the DFCI reported much more satisfactory performance than with some R implementation they were formerly using.

Gregory On Feb 21, 2017 12:28 PM, "puli83" notifications@github.com wrote:

Ok, so I understand it's better to feed algorithm with a big ensemble of partitions instead of choosing 10 partitions corresponding to k = n.

One of these plots shows that AvMutualInfor is smooth after a certain number of k. Differences between k = 23 and k = 44 are very small in term of highest AvMutualInform. I think that this shows that, taking a big ensemble of partitions (490 different partitions), the consensus clustering reach a ceiling at a certain number of k. And maybe I can use this to chose best number of k avoiding to pick up complex partition (such as partitions having a big value of k. ex. k = 44 or k = 50).

Just I am not yet sure if it is normal to have the highest AvMutualInfor (MCLA at 0.642476878275) corresponding to k = 50 using N_clusters_max = None, but , when I extract a specific AvMutualInfor for each N_clusters_max = n, I have a the highest AvMutualInfor (MCLA', 44, 0.6566595470531067) corresponding to k = 44. I mean, I expect to have the highest AvMutualInfor at k = 50, don't you?

If you will take a look to plots could be nice.

in any case, thank you very much for yours answers

2017-02-21 11:26 GMT-05:00 Gregory Giecold notifications@github.com:

I cannot see them on my smartphone.

Gregory

On Feb 21, 2017 11:16 AM, "puli83" notifications@github.com wrote:

[image: Immagine incorporata 1][image: Immagine incorporata 2] These are the plots, can you visualize them now?

I read your answer now

2017-02-21 11:15 GMT-05:00 Gregory Giecold <notifications@github.com :

Could you please provide the plots referred to in your email?

Comparing the ANMI computed for all partitions with, e.g., k=3 to the ANMI resulting from an ensemble encompassing all partitions with, say, k=5 is unwarranted. The partitions in the first group might be more similar to each other, whereas those from the second set might exhibit more within-group variability. This is bound to guarantee a higher ANMI score for the first group but is not indicative of k=3 being a better description of your dataset than k=5.

The whole issue is tricky but I would still advise pooling all your partitions into a big ensemble, possibly after removing some partitions. For instance, if all k=10 partitions are almost identical, keep just a few.

Gregory

On Feb 21, 2017 10:46 AM, "puli83" notifications@github.com wrote:

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k = 2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281382632, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56pqlprz5xdhXZqU1t9vHEBBw5Syks5rewbegaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281391888, or mute the thread https://github.com/notifications/unsubscribe- auth/AOrTTO6cxxc5C4yMr3_ KcCCFfzQv9yFrks5rew2LgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281392503, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56YQ5LPT7iXauOm84tdRXmG6ZxDhks5rew30gaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281395784, or mute the thread https://github.com/notifications/unsubscribe-auth/ AOrTTNldR5P8B1fWUQV71XDJ5ViKAJdLks5rexBNgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281415452, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56mfWqwCcaX9wv6ajLdEJkGaTT5Tks5rex7QgaJpZM4MDhzB .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-281488847, or mute the thread https://github.com/notifications/unsubscribe-auth/AOrTTI5jUnW3oKAzGDep2WISO4lWp9ciks5re1jTgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

GGiecold-zz commented 7 years ago

Hello.

No problem about providing some more guidelines if need be. We found this method to have good performance on much noisier datasets than you are currently handling.

All the best,

Gregory On Feb 21, 2017 5:43 PM, "puli83" notifications@github.com wrote:

hi,

Thank you very much for all these information. I will try the gap statistic in python. I had problem with an R implementation of that.

I found cluster ensemble a nice method to reclustering data, and in a certain sense to find the optimal number of k. I will try to compare gap statistic with cluster ensemble selection of optimal k. Anyway, I understand that this method is better used for reclustering, for optimize partition from knowledge reuse. I found it very interesting. I will present your cluster ensemble to my team here in UQAM university, also because the concept of "consensus clustering" is interesting, and your method seems to really have good performance.

Thank you for all your time you give to me,

and for all the work done on this python package, it' very appreciated

all the best,

2017-02-21 16:36 GMT-05:00 Gregory Giecold notifications@github.com:

Addendum: you'll find a reliable Python implementation of the gap statistics in the PySCUBA package. Please check it out on Github. Some of my colleagues at Harvard and the DFCI reported much more satisfactory performance than with some R implementation they were formerly using.

Gregory On Feb 21, 2017 12:28 PM, "puli83" notifications@github.com wrote:

Ok, so I understand it's better to feed algorithm with a big ensemble of partitions instead of choosing 10 partitions corresponding to k = n.

One of these plots shows that AvMutualInfor is smooth after a certain number of k. Differences between k = 23 and k = 44 are very small in term of highest AvMutualInform. I think that this shows that, taking a big ensemble of partitions (490 different partitions), the consensus clustering reach a ceiling at a certain number of k. And maybe I can use this to chose best number of k avoiding to pick up complex partition (such as partitions having a big value of k. ex. k = 44 or k = 50).

Just I am not yet sure if it is normal to have the highest AvMutualInfor (MCLA at 0.642476878275) corresponding to k = 50 using N_clusters_max = None, but , when I extract a specific AvMutualInfor for each N_clusters_max = n, I have a the highest AvMutualInfor (MCLA', 44, 0.6566595470531067) corresponding to k = 44. I mean, I expect to have the highest AvMutualInfor at k = 50, don't you?

If you will take a look to plots could be nice.

in any case, thank you very much for yours answers

2017-02-21 11:26 GMT-05:00 Gregory Giecold notifications@github.com:

I cannot see them on my smartphone.

Gregory

On Feb 21, 2017 11:16 AM, "puli83" notifications@github.com wrote:

[image: Immagine incorporata 1][image: Immagine incorporata 2] These are the plots, can you visualize them now?

I read your answer now

2017-02-21 11:15 GMT-05:00 Gregory Giecold < notifications@github.com :

Could you please provide the plots referred to in your email?

Comparing the ANMI computed for all partitions with, e.g., k=3 to the ANMI resulting from an ensemble encompassing all partitions with, say, k=5 is unwarranted. The partitions in the first group might be more similar to each other, whereas those from the second set might exhibit more within-group variability. This is bound to guarantee a higher ANMI score for the first group but is not indicative of k=3 being a better description of your dataset than k=5.

The whole issue is tricky but I would still advise pooling all your partitions into a big ensemble, possibly after removing some partitions. For instance, if all k=10 partitions are almost identical, keep just a few.

Gregory

On Feb 21, 2017 10:46 AM, "puli83" notifications@github.com wrote:

​​ Hi,

thank you very much for your answer. It's helpful.

There is something not clear yet. Maybe I give your more details.

I have real data based on newspaper. I have 2500 newspapers articles in 23000 stemmed word. 1) I execute LSA = 300 in order to reduce dimensionality. 2) I normalize vectors and I execute different spherical Kmeans initialized with kmeans++, from k = 2 to k = 50. 10 times for each k 3) Now I have several partitions, 10 for each value of k, for a total of 490 different partitions.

a) in order to what you explain me before, I can feed algorithm 2) with all 10 different partitions at k = 2 for N_clusters_max = 2, or with all 10 different partitions at k = 3 for N_clusters_max = 3, etc. b) But i can also feed the algorithm 1) with all these 490 partitions (" The same hold by pooling partitions with different numbers of clusters ​"​) c) because " 2) is likely to result in a similar consensus clustering as the one obtained via your point 1) ​"​

Now, I expect something, but something else happened. Maybe, I do something wrong, but, if best k is chosen by cluster_ensemble, why AveregeMutualInformation is different if I execute 1) or 2) strategy, giving me different best value of k? for example, If I put N_clusters_max = None, I have always the bigger k as best value of k, such as k = 50 in my case. But I want the BEST is chosen. Alternatively, if I try to extract the AverageMutualInformation for each MLCA computed when N_clusters_max is specified (= 2, = 3, = n), I obtain that k = 44 has the MLCA with the biggest AvMutuaInformation, so k = 44 should be the best k and should be chosen. How I understand, It should happen also if N_clusters_max = None.

These are some tests:

4) I try Ce.cluster_ensembles() with parameter N_clusters_max = None. I feed algorithm with all 490 partitions. I have this kind of value at the end:

INFO: Cluster_Ensembles: MCLA: delivering 50 clusters. INFO: Cluster_Ensembles: MCLA: average posterior probability is 0.0312013447061 INFO: Cluster_Ensembles: cluster_ensembles: MCLA at 0.642476878275.

​5) I try different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all 490 partitions.

I join a plot of results. These are some :

[(b'CSPA', 49, 0.61262244071635907), (b'HGPA', 49, 0.48564481818856481), (b'MCLA', 49, 0.65322653517297224)]] [(b'CSPA', 48, 0.6108469189488881), (b'HGPA', 48, 0.50390445428995523), (b'MCLA', 48, 0.64061304679294828)], [(b'CSPA', 46, 0.5938184999934204), (b'HGPA', 46, 0.48515758475999798), (b'MCLA', 46, 0.64731992002628413)], [(b'CSPA', 44, 0.60673453526918586), (b'HGPA', 44, 0.48732650891378437), (b'MCLA', 44, 0.6566595470531067)], [(b'CSPA', 34, 0.60836816873145905), (b'HGPA', 34, 0.49802519570639958), (b'MCLA', 34, 0.64419747246985004)],

​6) I try ​different ​Ce.cluster_ensembles() with a specific value of parameter N_clusters_max, from 2 to 50. I feed algorithm with all corresponding 10 partitions with k = N_clusters_max (so with k

2 if N_clusters_max = 2, with k = 3 if N_clusters_max = 3, etc.). I have a very different results, and this is was expected for me, and the best AvMutualInform correspond to k = 3 in this case.

I join the plot of results.

​At the end, It is not clear how can I use cluster_ensemble to find best k. I know that for each k I have a best groupings, but which strategy use to find best k? I observe also then, in the first plot (point 5) the trend of highest MLCA AvMutualInformation is very smooth after k = 20/ k = 25. That means for me that any partitions at that point is similar in term of optimization. But, I wonder, can I use the elbow method here? Picking some partitions between k = 20 to k = 25 ?

What do you thing about all this? How can I interpret these results?

Thank you a lot for attention. I hope you could answer me.

Have a nice day

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281382632, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56pqlprz5xdhXZqU1t9vHEBBw5Syks5rewbegaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281391888, or mute the thread https://github.com/notifications/unsubscribe- auth/AOrTTO6cxxc5C4yMr3_ KcCCFfzQv9yFrks5rew2LgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281392503, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56YQ5LPT7iXauOm84tdRXmG6ZxDhks5rew30gaJpZM4MDhzB

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281395784, or mute the thread https://github.com/notifications/unsubscribe-auth/ AOrTTNldR5P8B1fWUQV71XDJ5ViKAJdLks5rexBNgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281415452, or mute the thread https://github.com/notifications/unsubscribe-auth/ AK3j56mfWqwCcaX9wv6ajLdEJkGaTT5Tks5rex7QgaJpZM4MDhzB .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8# issuecomment-281488847, or mute the thread https://github.com/notifications/unsubscribe-auth/ AOrTTI5jUnW3oKAzGDep2WISO4lWp9ciks5re1jTgaJpZM4MDhzB .

--

Davide Pulizzotto

Doctorant en Sémiologie Université du Québec à Montréal

Chercheur LANCI-UQAM Département de Philosophie tel. (514) 987-3000 poste 0339

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/Cluster_Ensembles/issues/8#issuecomment-281506599, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j55_nx-NNLSsu9Xrn7lTQl1vFk9Ebks5re2h-gaJpZM4MDhzB .