WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
120 stars 20 forks source link

About silhouette_score #203

Closed czhang03 closed 9 years ago

czhang03 commented 9 years ago

I and Austin has spend a long time on that. the document of scipy is terrible there are variable that cannot find any reference to.

for now we can print from the back-end how the cluster is divided, but we cannot figure out its behaviour. It seems like it is unwilling to divide some of the segments in different cluster not matter what setting you give it.

I think for now, we can give the user how this is divided so the user can know what they are doing.


For future I think we can rewrite the algorithm for silhouette score, and we can let the user to customize the cluster and choose how deep they want to go into the dendrogram.

Silhouette_score algorithm is not hard to write anyway.

czhang03 commented 9 years ago

@mleblanc321 what do you think

mleblanc321 commented 9 years ago

show me the written description, that is what i want; i encourage you to use the material i sent, including the ranges of strength of cohesion

akuisara commented 9 years ago

The document for scipy is not really helpful, but you might get some help from their source code about fcluster to see how the clusters are formed: https://github.com/scipy/scipy/blob/v0.15.1/scipy/cluster/hierarchy.py#L1446

czhang03 commented 9 years ago

I have seen the source code, it is just basically calling other functions and if you track the function, you will found that it is just calling other functions...

It is not that helpful too

czhang03 commented 9 years ago

If you want me to talk about silhouette_score I can do that, but for flat cluster I am not sure we can really explain that clearly. We can copy and paste the document there, but that will not help the user understand that at all.

And the main problem is silhouette_score has no meaning without knowing how the cluster is divided