Open jiangfeng1124 opened 10 years ago
Hi Jiang,
You'll want to use the GwsMain class which uses the GenericWordSpace class instead of the VsmMain class. I'm not sure if we have out-of-the-box support for PMI though.
Thanks, David
On Wed, Apr 2, 2014 at 5:42 AM, jiangfeng notifications@github.com wrote:
Dear developers,
I found that the VsmMain computes the word-document matrix, which concerns the co-occurrences of words and documents. Could I generate distributional representation using the context within a certain size of window (say: 10), and use the PMI, rather than tf-idf as the element in the word-context matrix?
Thanks, Jiang
Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/50 .
GwsMain
looks good, thank you.
However, I found a problem and I am not sure whether it is a bug:
This is what I get when running GwsMain:
Command:
java edu.ucla.sspace.mains.GwsMain -d data/wiki.sample data/output-sample/ -t 6 -o sparse_text -F include=data/wiki_vocab_sample.lst;exclude=data/english-stop-words-large.txt
What I get:
... |0,994.0,1,2457.0,2,796.0,3,19110.0,4,1510.0,5,1990.0,6,1256.0,7,18830.0,... ...
It seems that representation of an empty word is generated. Could you help check this?
Thanks, Jiang
Hi Jiang,
Yes, this looks like a bug. The boolean logic for filtering this case was missing parentheses in the code, so the vector you found is because some internal token filtering escaped into the output. I've fixed this issue in the latest commit and pushed it to the trunk. Thanks for reporting it!
Thanks, David
On Wed, Apr 2, 2014 at 10:03 PM, jiangfeng notifications@github.com wrote:
GwsMain looks good, thank you. However, I found a problem and I am not sure whether it is a bug: This is what I get when running GwsMain:
Command:
java edu.ucla.sspace.mains.GwsMain -d data/wiki.sample data/output-sample/ -t 6 -o sparse_text -F include=data/wiki_vocab_sample.lst;exclude=data/english-stop-words-large.txt
What I get:
... |0,994.0,1,2457.0,2,796.0,3,19110.0,4,1510.0,5,1990.0,6,1256.0,7,18830.0,... ...
It seems that representation of an empty word is generated. Could you help check this?
Thanks, Jiang
Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/50#issuecomment-39408087 .
Hi David,
I would like to ask a little more. I realized that the GwsMain
class outputs the raw counts in context vectors. I am wondering whether the results could be further processed by LSA
or other dimensionality reduction algorithms, so that I can get a low-dimensional representation?
Thanks, Jiang
Hi Jiang,
So you want to report the output of GwsMain and then use that as input to LSA? It might be easier to just run GwsMain and then LSAMain on the same dataset, though I think Gws is a term-by-term algorithm, so it would interpret the context differently than LSA.
If all you want to do is run SVD on the GWS data, that's currently not supported, but I could probably put it in fairly quickly too. :)
Thanks, David
On Sun, May 4, 2014 at 8:15 AM, jiangfeng notifications@github.com wrote:
Hi David,
I would like to ask a little more. I realized that the GwsMain class outputs the raw counts in context vectors. I am wondering whether the results could be further processed by LSA or other dimensionality reduction algorithms, so that I can get a low-dimensional representation?
Thanks, Jiang
— Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/50#issuecomment-42132685 .
Dear developers,
I found that the VsmMain computes the
word-document matrix
, which concerns the co-occurrences of words and documents. Could I generate distributional representation using the context within a certain size of window (say: 10), and use the PMI, rather than tf-idf as the element in theword-context matrix
?Thanks, Jiang