Closed GoogleCodeExporter closed 8 years ago
I'll check this out tonight.
A few questions for the bug filer.
1. are you using the Project Gutenberg version of that corrpus?
2. how is the corpus divided into documents?
3. what is the similarity of the two words in the LSA and RI spaces? Are they
somewhat similar in RI or mostly unrelated?
4. also, how did you try using SVDLIBC? I'm not sure if we've seen anyone
successfully use it on Windows, so it might be nice to improve our
documentation based on your experiences. I think SVDLIBJ is definitely the
best SVD on Windows at the moment (at least for our approach)
Original comment by David.Ju...@gmail.com
on 28 Jun 2010 at 10:42
Hello,
it is a french corpus: http://abu.cnam.fr/cgi-bin/donner_html?tdm80j2
I just joined lines of the same paragraph, so one line <=> one paragraph
similarity 0.74 (LSA) and 0.29 (RI)
I just compiled SVDLIBC on Vista, it works alone but with RI or LSA process is
sleeping ...
Original comment by alain.dh...@gmail.com
on 29 Jun 2010 at 6:21
So I did some checking and I think the differences are fundamental to the
algorithms themselves. LSA is creating an associative similarity. Since
Phileas and Fogg frequently show up in the same documents, they appear similar
in their distributions. However, RI is creating attributional similarity:
words whose neighbors are frequently the same end up being similar. From what
I've seen in the data, the differences you are seeing are expected given the
distributions of Phileas and Fogg.
However, RI can be used to approximate the Vector Space Model (VSM), which is
the non-dimensionally-reduced form of LSA. If you were planning to use this
behavior, I believe we have a code patch coming to support this. With this
form of RI, you should see Phileas and Fogg be more similar, just as you do
with LSA. However, this is due to algorithmic differences, rather than bug
fixes. Were you originally planning to use this form of RI? (It was described
this way in the original Kanerva et al. paper.)
Original comment by David.Ju...@gmail.com
on 29 Jun 2010 at 7:39
Thank you for your explanations (I think I have to read more seriously RI
theory ...). I used RI just because I thought it gives much quicker results on
large corpus but in my case LSA seems to be sufficient.
Some questions:
1- Is it possible to use only stop words ?(error when I just use
--tokenFilter=stopwords.txt=exclude).
2- For LSA, is it possible to adjust matrix using tf-idf ?
3- Is it possible to simply extract clusters of neighbors from semantic space ?
Original comment by alain.dh...@gmail.com
on 29 Jun 2010 at 10:00
1. Yes, I think you're using the old syntax for filtering. It should be
--tokenFilter exclude=stopwords.txt . (I checked and it looks like the LSA
wiki is out of date for that, which I'll fix right now.)
2. Yes, add -p edu.ucla.sspace.matrix.TfIdfTransform , which should change the
transformation from log-entropy to TF-IDF
3. Yes, if you're using Java, you'll want to use
edu.ucla.sspace.common.WordComparator , which will find the neareset neighbors
for any word. You can run this over the all the words in a SemanticSpace to
build the clusters. Does that meet your needs, or did you want a different
kind of clustering?
Original comment by David.Ju...@gmail.com
on 29 Jun 2010 at 4:58
Hi,
1- Ok.
2- Ok (I will try).
3- I think WordComparator is what you get when you use "get-neighbors .." from
the Semantic Space interpreter ? For clustering, I will try weka hierarchical
clustering package on the semantic space.
About RI on Windows, I had some problems: building semantic space end normally
but when I try to load the semantic space it sometimes freezes. This behaviour
seems to depend on options (dimensions size, permutation option ...) not very
clear.
for SVDLIBC, I used MinGW to compile it (you have to change some includes and
Makefile maybe with CYGWIN it works as is). I change PATH to include svd.exe
and the library but when starting building RI semantic space it always freeze
...
Original comment by alain.dh...@gmail.com
on 30 Jun 2010 at 9:23
Ok, for clustering, you really are looking for a tree-like clustering then? I
think we have something like that in our clustering package, but it's not
immediately transferable without a bit of work. I'll look into exposing the
required functionality
Can you post the stack traces for the errors you see. Also, what
configurations freeze for you. RI has been very stable for us, so it's a bit
concerning that you see if freezing. If you feel motivated, you can attach to
the process with jConsole when it freezes to see what it's blocking on. That
would *really* help us track down what might be the issue.
Also, RI should need to use svd.exe, so I don't think that would be the cause.
It's good to know that MinGW can compile SVDLIBC. I'll add that to the docs
somewhere.
Original comment by David.Ju...@gmail.com
on 30 Jun 2010 at 7:48
1- Hierarchical clustering (http://en.wikipedia.org/wiki/Dendrogram).
I want to extract groups of neighbors Ex: (Phileas, Fogg, Mr, ...),(Aouda, Mrs,
...) cutting hierarchical tree at the right place. I saw a weeka package but if
you have a package ....
2- For SpaceExplorer cpu loop using RI, it works with 2048 dim but not for
4096, maybe the size of the .sspace file (300M). (see attach file)
3-For SVDLIBC, freeze was for LSA method (sorry not RI !) (see attach file).
Original comment by alain.dh...@gmail.com
on 30 Jun 2010 at 10:30
Attachments:
2 Ok I think I understand the problem for cpu loop when using loading semantic
space with RI algorithm:
With 4000 dim and --usePermutation, output.sspace size is 10M but when I
suppress --usePermutation option, output.sspace size increase to 300M and
semantic explorer interpreter cannot load the semantic space !
Without --usePermutation option I have to decrease vector size to 1000 and then
output.sspace decrease to 70M
I am not sure about this option --usePermutation, I think it increases the
weight of the nearest words ?
Also I understand now difference between LSA and RI using random word vectors.
RI gives you maybe synonyms or same family words ...
For performance, I think RI using random document vectors describe in Kanerva
et al. paper.) to approximate LSA should also be implement.
Original comment by alain.dh...@gmail.com
on 1 Jul 2010 at 9:10
Closing this issue as it's been addressed.
Original comment by FozzietheBeat@gmail.com
on 24 Feb 2011 at 8:43
Original issue reported on code.google.com by
alain.dh...@gmail.com
on 28 Jun 2010 at 10:36