buliugu / airhead-research

Automatically exported from code.google.com/p/airhead-research
0 stars 0 forks source link

differences between LSA and RI #55

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. French corpus: "le tour du monde en quatre vingt jours" Jules Vernes
2. Watching co-occurency of two words: "Phileas" "Fogg" 
3. Using LSA and RI algorithms to create semantic space

What is the expected output? What do you see instead?
LSA+SVDLIBJ : "Phileas" and "Fogg" are neighbors 
RI: They are not. This result seems very curious ?

What version of the product are you using? On what operating system?
Windows Vista

Please provide any additional information below.
for LSA, I use SVDLIBJ because my SVDLIBC version fails (nothing happens...)

Original issue reported on code.google.com by alain.dh...@gmail.com on 28 Jun 2010 at 10:36

GoogleCodeExporter commented 9 years ago
I'll check this out tonight.  

A few questions for the bug filer.

1. are you using the Project Gutenberg version of that corrpus?

2. how is the corpus divided into documents?

3. what is the similarity of the two words in the LSA and RI spaces?  Are they 
somewhat similar in RI or mostly unrelated?

4. also, how did you try using SVDLIBC?  I'm not sure if we've seen anyone 
successfully use it on Windows, so it might be nice to improve our 
documentation based on your experiences.  I think SVDLIBJ is definitely the 
best SVD on Windows at the moment (at least for our approach)

Original comment by David.Ju...@gmail.com on 28 Jun 2010 at 10:42

GoogleCodeExporter commented 9 years ago
Hello,

it is a french corpus: http://abu.cnam.fr/cgi-bin/donner_html?tdm80j2

I just joined lines of the same paragraph, so one line <=> one paragraph

similarity 0.74 (LSA) and 0.29 (RI)

I just compiled SVDLIBC on Vista, it works alone but with RI or LSA process is 
sleeping ...

Original comment by alain.dh...@gmail.com on 29 Jun 2010 at 6:21

GoogleCodeExporter commented 9 years ago
So I did some checking and I think the differences are fundamental to the 
algorithms themselves.  LSA is creating an associative similarity.  Since 
Phileas and Fogg frequently show up in the same documents, they appear similar 
in their distributions.  However, RI is creating attributional similarity: 
words whose neighbors are frequently the same end up being similar.  From what 
I've seen in the data, the differences you are seeing are expected given the 
distributions of Phileas and Fogg.

However, RI can be used to approximate the Vector Space Model (VSM), which is 
the non-dimensionally-reduced form of LSA.  If you were planning to use this 
behavior, I believe we have a code patch coming to support this.  With this 
form of RI, you should see Phileas and Fogg be more similar, just as you do 
with LSA.  However, this is due to algorithmic differences, rather than bug 
fixes.  Were you originally planning to use this form of RI?  (It was described 
this way in the original Kanerva et al. paper.)

Original comment by David.Ju...@gmail.com on 29 Jun 2010 at 7:39

GoogleCodeExporter commented 9 years ago
Thank you for your explanations (I think I have to read more seriously RI 
theory ...). I used RI just because I thought it gives much quicker results on 
large corpus but in my case LSA seems to be sufficient.

Some questions:
1- Is it possible to use only stop words ?(error when I just use 
--tokenFilter=stopwords.txt=exclude).
2- For LSA, is it possible to adjust matrix using tf-idf ?
3- Is it possible to simply extract clusters of neighbors from semantic space ?

Original comment by alain.dh...@gmail.com on 29 Jun 2010 at 10:00

GoogleCodeExporter commented 9 years ago
1.  Yes, I think you're using the old syntax for filtering.  It should be 
--tokenFilter exclude=stopwords.txt .  (I checked and it looks like the LSA 
wiki is out of date for that, which I'll fix right now.)

2.  Yes, add -p edu.ucla.sspace.matrix.TfIdfTransform , which should change the 
transformation from log-entropy to TF-IDF

3.  Yes, if you're using Java, you'll want to use 
edu.ucla.sspace.common.WordComparator , which will find the neareset neighbors 
for any word.  You can run this over the all the words in a SemanticSpace to 
build the clusters.  Does that meet your needs, or did you want a different 
kind of clustering?

Original comment by David.Ju...@gmail.com on 29 Jun 2010 at 4:58

GoogleCodeExporter commented 9 years ago
Hi,
1- Ok. 
2- Ok (I will try).
3- I think WordComparator is what you get when you use "get-neighbors .." from 
the Semantic Space interpreter ? For clustering, I will try weka hierarchical 
clustering package on the semantic space.

About RI on Windows, I had some problems: building semantic space end normally 
but when I try to load the semantic space it sometimes freezes. This behaviour 
seems to depend on options (dimensions size, permutation option ...) not very 
clear.

for SVDLIBC, I used MinGW to compile it (you have to change some includes and 
Makefile maybe with CYGWIN it works as is). I change PATH to include svd.exe 
and the library but when starting building RI semantic space it always freeze 
... 

Original comment by alain.dh...@gmail.com on 30 Jun 2010 at 9:23

GoogleCodeExporter commented 9 years ago
Ok, for clustering, you really are looking for a tree-like clustering then?  I 
think we have something like that in our clustering package, but it's not 
immediately transferable without a bit of work.  I'll look into exposing the 
required functionality

Can you post the stack traces for the errors you see.  Also, what 
configurations freeze for you.  RI has been very stable for us, so it's a bit 
concerning that you see if freezing.  If you feel motivated, you can attach to 
the process with jConsole when it freezes to see what it's blocking on.  That 
would *really* help us track down what might be the issue.

Also, RI should need to use svd.exe, so I don't think that would be the cause.  
It's good to know that MinGW can compile SVDLIBC.  I'll add that to the docs 
somewhere.

Original comment by David.Ju...@gmail.com on 30 Jun 2010 at 7:48

GoogleCodeExporter commented 9 years ago
1- Hierarchical clustering (http://en.wikipedia.org/wiki/Dendrogram). 
I want to extract groups of neighbors Ex: (Phileas, Fogg, Mr, ...),(Aouda, Mrs, 
...) cutting hierarchical tree at the right place. I saw a weeka package but if 
you have a package ....

2- For SpaceExplorer cpu loop using RI, it works with 2048 dim but not for 
4096, maybe the size of the .sspace file (300M). (see attach file) 

3-For SVDLIBC, freeze was for LSA method (sorry not RI !) (see attach file).

Original comment by alain.dh...@gmail.com on 30 Jun 2010 at 10:30

Attachments:

GoogleCodeExporter commented 9 years ago
2 Ok I think I understand the problem for cpu loop when using loading semantic 
space with RI algorithm:
With 4000 dim  and --usePermutation, output.sspace size is 10M but when I 
suppress --usePermutation option, output.sspace size increase to 300M and 
semantic explorer interpreter cannot load the semantic space !
Without --usePermutation option I have to decrease vector size to 1000 and then 
output.sspace decrease to 70M

I am not sure about this option --usePermutation, I think it increases the 
weight of the nearest words ? 

Also I understand now difference between LSA and RI using random word vectors. 
RI gives you maybe synonyms or same family words ...

For performance, I think RI using random document vectors describe in Kanerva 
et al. paper.) to approximate LSA should also be implement.

Original comment by alain.dh...@gmail.com on 1 Jul 2010 at 9:10

GoogleCodeExporter commented 9 years ago
Closing this issue as it's been addressed.

Original comment by FozzietheBeat@gmail.com on 24 Feb 2011 at 8:43