fozziethebeat / S-Space

The S-Space repsitory, from the AIrhead-Research group
GNU General Public License v2.0
205 stars 106 forks source link

Distributional semantics using contexts rather than documents #50

Open jiangfeng1124 opened 10 years ago

jiangfeng1124 commented 10 years ago

Dear developers,

I found that the VsmMain computes the word-document matrix, which concerns the co-occurrences of words and documents. Could I generate distributional representation using the context within a certain size of window (say: 10), and use the PMI, rather than tf-idf as the element in the word-context matrix?

Thanks, Jiang

davidjurgens commented 10 years ago

Hi Jiang,

You'll want to use the GwsMain class which uses the GenericWordSpace class instead of the VsmMain class. I'm not sure if we have out-of-the-box support for PMI though.

Thanks, David

On Wed, Apr 2, 2014 at 5:42 AM, jiangfeng notifications@github.com wrote:

Dear developers,

I found that the VsmMain computes the word-document matrix, which concerns the co-occurrences of words and documents. Could I generate distributional representation using the context within a certain size of window (say: 10), and use the PMI, rather than tf-idf as the element in the word-context matrix?

Thanks, Jiang

Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/50 .

jiangfeng1124 commented 10 years ago

GwsMain looks good, thank you. However, I found a problem and I am not sure whether it is a bug: This is what I get when running GwsMain:

Command:

java edu.ucla.sspace.mains.GwsMain -d data/wiki.sample data/output-sample/ -t 6 -o sparse_text -F include=data/wiki_vocab_sample.lst;exclude=data/english-stop-words-large.txt

What I get:

...
|0,994.0,1,2457.0,2,796.0,3,19110.0,4,1510.0,5,1990.0,6,1256.0,7,18830.0,...
...

It seems that representation of an empty word is generated. Could you help check this?

Thanks, Jiang

davidjurgens commented 10 years ago

Hi Jiang,

Yes, this looks like a bug. The boolean logic for filtering this case was missing parentheses in the code, so the vector you found is because some internal token filtering escaped into the output. I've fixed this issue in the latest commit and pushed it to the trunk. Thanks for reporting it!

Thanks, David

On Wed, Apr 2, 2014 at 10:03 PM, jiangfeng notifications@github.com wrote:

GwsMain looks good, thank you. However, I found a problem and I am not sure whether it is a bug: This is what I get when running GwsMain:

Command:

java edu.ucla.sspace.mains.GwsMain -d data/wiki.sample data/output-sample/ -t 6 -o sparse_text -F include=data/wiki_vocab_sample.lst;exclude=data/english-stop-words-large.txt

What I get:

... |0,994.0,1,2457.0,2,796.0,3,19110.0,4,1510.0,5,1990.0,6,1256.0,7,18830.0,... ...

It seems that representation of an empty word is generated. Could you help check this?

Thanks, Jiang

Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/50#issuecomment-39408087 .

jiangfeng1124 commented 10 years ago

Hi David,

I would like to ask a little more. I realized that the GwsMain class outputs the raw counts in context vectors. I am wondering whether the results could be further processed by LSA or other dimensionality reduction algorithms, so that I can get a low-dimensional representation?

Thanks, Jiang

davidjurgens commented 10 years ago

Hi Jiang,

So you want to report the output of GwsMain and then use that as input to LSA? It might be easier to just run GwsMain and then LSAMain on the same dataset, though I think Gws is a term-by-term algorithm, so it would interpret the context differently than LSA.

If all you want to do is run SVD on the GWS data, that's currently not supported, but I could probably put it in fairly quickly too. :)

Thanks, David

On Sun, May 4, 2014 at 8:15 AM, jiangfeng notifications@github.com wrote:

Hi David,

I would like to ask a little more. I realized that the GwsMain class outputs the raw counts in context vectors. I am wondering whether the results could be further processed by LSA or other dimensionality reduction algorithms, so that I can get a low-dimensional representation?

Thanks, Jiang

— Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/50#issuecomment-42132685 .