Open GoogleCodeExporter opened 8 years ago
Hi,
We currently have support for hierarchical clustering (lie WEKA does) within
the S-Space Package, so clustering the neighbors should be possible. However,
I don't have a clear sense of what kind of output you're looking for. Could
you describe a bit more what information you're hoping to see and if possible,
sketch a toy example of the output you'd like to see? I think we can implement
this, but I'd like to make sure we're implementing what you're asking for.
Original comment by David.Ju...@gmail.com
on 7 Mar 2011 at 6:13
The idea is trying to find "concepts" through clusters of words which are
closer in semantic sense (it is not a new idea also maybe not a very good one
but I ' like to see the results !).
Hierarchical clustering seems to be a simple solution just because you already
have a function to get the first neighbor of a word.
Because Hierarchical clustering is a tree, the algorithm has to cut the tree at
a certain level using a specific metric to identify the more interesting
agglomerative clusters.
Ex (very simple):
Words: Jules, Verne, french, author, restaurant, tower, Effeil
|--------------/cut/---------------|
|------------| |------------------|
|-----| |------| |-----| |
french author Jules Verne Effeil tower restaurant
Concepts are : "Jules Verne french author", "restaurant at Effeil tower (le
Jules Verne)"
Original comment by alain.dh...@gmail.com
on 8 Mar 2011 at 9:41
Ok, I think I see what you're getting at. So you would like to have the
k-nearest-neighbors clustered and then show the dendrogram for those neighbors?
Cutting the dendrogram is a bit trickier since there's a *large* number of ways
to do this, e.g. min-similarity, number of merges, edge similarities, etc. I
think it may be easiest to just report the full dendrogram and somehow encode
the similarities in the output so the user can see what the data looks like.
So would something like this work:
> cluster-neighbors Jules 6
/- french
(.71) /\- author
/
(.60) /\- Verne
/
Jules - \
\/- restaurant
(.42) \
\/- Eiffel
(.83) \- tower
I think the vertical alignment might be easier to mechanically reproduce and
also allow us to fit more neighbors on a screen (imagine doing 30 neighbors
horizontally :) )
Does this match what you had in mind?
Original comment by David.Ju...@gmail.com
on 8 Mar 2011 at 7:16
Well if you can also draw ascii art ;-)
I 'm not sure that we have the same idea, you speak about knn algorithm and
then print the result using a tree (2 branches):
You have to precise one word and number of clusters
>knn Jules 6
------------------------------------------------
| Jules |
| ------------------- ---------------------- |
| | Verne | | restaurant ||
| | --------------- | | --------------- ||
| | | french author || | | Effeil tower | ||
| | --------------- | | --------------- ||
| ------------------- ---------------------- |
-------------------------------------------------
Is that correct ?
In my mind I use directly dendogramm that is I search for each word the nearest
neighbor and then iterate on cluster
I don't precise word and number of clusters
>dendogramm
1 iteration (Jules-Verne) (French author) (Effeil tower) (Restaurant)
2 iteration ((Jules-Verne)-(French author)) ((Effeil tower)-(restaurant))
3 iteration (((Jules-Verne)-(French author)) ((Effeil tower)-(restaurant)))
for the 1 iteration we simple use your neighbor function between two words but
for the next iterations we have to compute distance between two clusters of
words ...
Using your representation:
/-Jules
/\-Vernes
/ /
/\- \/-French
/ \-Author
/
/
\
\
\ /-Effeil
\/-/\-tower
\ \
\/-restaurant
\-
Original comment by alain.dh...@gmail.com
on 9 Mar 2011 at 10:32
May be to extract only a part of the tree we can use parameters:
>dendogram 3 jules Verne
Which means:
Get the part of the tree from leaves "jules", "Verne" using 3 node level ancestors
or
>dendogram jules Verne restaurant
Which means:
Get the part of the tree containing leaves "jules", "Verne" and "restaurant"
And then after if we click on one node we can see all relevant documents etc.
Original comment by alain.dh...@gmail.com
on 9 Mar 2011 at 10:58
Oups, you are right it is "Eiffel" and not "Effeil" ! (sorry Gustave ...)
Original comment by alain.dh...@gmail.com
on 9 Mar 2011 at 1:37
Original issue reported on code.google.com by
alain.dh...@gmail.com
on 7 Mar 2011 at 1:43