askerlee / topicvec

197 stars 69 forks source link

Short text #7

Closed bwang482 closed 7 years ago

bwang482 commented 7 years ago

Hi askerlee, thanks for your great work!

Would this work on short text like tweets? If so, what parameters should I change?

Thanks.

askerlee commented 7 years ago

Yeah it works on short text. You could start with csv2topic.py and tailor it for your needs. Csv2topic.py could read a lot of tweets stored in a csv file, each row containing a tweet.

On May 21, 2017 1:04 AM, "bluemonk482" notifications@github.com wrote:

Hi askerlee, thanks for your great work!

Would this work on short text like tweets? If so, what parameters should I change?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/askerlee/topicvec/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgKJb43TFbSdIbq4XQMNn-rmx0uZqfzks5r7x0wgaJpZM4NhWvA .

bwang482 commented 7 years ago

Thanks for your speedy reply!!

I am not sure I am fully understand your parameters. How do I create the unigram probs file?

Is the existing embedding file a subset of, for example, glove embedding file that contains only the words in the corpus? What is the existing residual file of core words?

What exactly is magnitude of topic embeddings?

Sorry for asking many questions. :)

askerlee commented 7 years ago

Thanks for asking these questions. They are indeed confusions that are not fully explained in the papers or on Github. So I'm happy to have the chance to explain here.

How do I create the unigram probs file?

The unigram prob file is created using psdvec/gramcount.pl. It's written in Perl with inline C++ for speed-up. Sorry for mixing Perl with Python. It's just for historical reasons. The inline C++ should compile easily under Linux or using Strawberry Perl under Windows, but not compatible with the popular Windows Perl distribution ActivePerl.

Is the existing embedding file a subset of, for example, glove embedding file that contains only the words in the corpus

For the existing embedding file, do you refer to 25000-180000-*? It's generated using PSDVec (my own MF-based embedding method). See https://github.com/askerlee/topicvec/tree/master/psdvec for more details.

What is the existing residual file of core words?

The residual file is not saved or loaded. Although the residuals appear in equations, they are actually constants. So whatever their values are, the optimization algorithm and the resulting topic vectors are not impacted.

What exactly is magnitude of topic embeddings?

The Frobenius norm. Visually, the F-norm of a vector measures how long a vector is, hence the "magnitude".

bwang482 commented 7 years ago

Thanks very much for your reply!

The unigram prob file is created using psdvec/gramcount.pl. It's written in Perl with inline C++ for speed-up. Sorry for mixing Perl with Python. It's just for historical reasons. The inline C++ should compile easily under Linux or using Strawberry Perl under Windows, but not compatible with the popular Windows Perl distribution ActivePerl.

May I ask what should be the correct input for psdvec/gramcount.pl? For example one tweet per line in .txt format?

For the existing embedding file, do you refer to 25000-180000-*? It's generated using PSDVec (my own MF-based embedding method). See https://github.com/askerlee/topicvec/tree/master/psdvec for more details.

I am referring to -v vec_file; is it recommended to use PSDVec generated embedding file? Can I use Glove Twitter word embeddings? If so, what's the requirement here, do I have to make sure all words in my corpus can be found in the embedding file? Also what's the format required for the embedding file (e.g. glove format or gensim format)?

Thanks again for your guidance! :+1:

askerlee commented 7 years ago

what should be the correct input for psdvec/gramcount.pl? For example one tweet per line in .txt format?

It should be a cleansed txt file containing the training corpus. The corpus usually contains many documents, and a commonly used corpus is English Wikipedia dump. The sentence or document boundaries are unimportant. So you could remove "\r\n" at the end of each line and concatenate all documents into one txt file. For the English language, non-ascii characters and punctuation marks should be removed. psdvec/cleancorpus.py does such preprocessing. Tweets, even millions of them, are too small as the corpus for training word embeddings.

I am referring to -v vec_file; is it recommended to use PSDVec generated embedding file? Can I use Glove Twitter word embeddings? If so, what's the requirement here, do I have to make sure all words in my corpus can be found in the embedding file? Also what's the format required for the embedding file (e.g. glove format or gensim format)?

yeah PSDVec embeddings are recommended. I haven't tried Glove embeddings. But I did experiments with word2vec embeddings. Its performance on document classification is a few percents worse than the topic embeddings generated from PSDVec embeddings. The format is the same as the word2vec embedding files.

bwang482 commented 7 years ago

It should be a cleansed txt file containing the training corpus. The corpus usually contains many documents, and a commonly used corpus is English Wikipedia dump. The sentence or document boundaries are unimportant. So you could remove "\r\n" at the end of each line and concatenate all documents into one txt file. For the English language, non-ascii characters and punctuation marks should be removed. psdvec/cleancorpus.py does such preprocessing. Tweets, even millions of them, are too small as the corpus for training word embeddings.

Yeah but they should be enough for creating the unigram prob file right?

Also, do you think I can use 25000-180000-500-BLK-8.0.vec aka the PSDVec generated embedding file for my tweet corpus?

Thanks!

askerlee commented 7 years ago

Yeah but they should be enough for creating the unigram prob file right?

It's better to use the same corpus to generate the unigram probs and embeddings. I'm not sure what adverse effects it will bring if you use a different file. Using the same file is safe. The unigram prob file extracted from Wikipedia is available online (in the same folder as the 25000-180000* files).

Also, do you think I can use 25000-180000-500-BLK-8.0.vec aka the PSDVec generated embedding file for my tweet corpus?

Yes sure. I used it to extract topics from many tweets. The results looked good.

bwang482 commented 7 years ago

Hi again askerlee,

May I ask is docs_Em the "posterior document-topic distributions of the test sets were derived by performing one E-step" that you mentioned in your paper for your document classification experiment? If so you used argmax of each row of docs_Em to infer the class of this document?

Also, how do I come about to generate document embeddings that I can use for downstream tasks. What do you mean by jointly representing a document by topic proportions and topic embeddings?

Thank you very much!

askerlee commented 7 years ago

is docs_Em the "posterior document-topic distributions of the test sets were derived by performing one E-step"

Yeah it is.

you used argmax of each row of docs_Em to infer the class of this document?

No. We need to train a document classifier using docs_Em as the features of the documents. There usually is no one-to-one correspondence between topics and document classes. For example we have 10 classes but we have 100 topics. We don't even know which subset of topics correspond to a document class. The same happens to LDA topics. Even if a topic is about sports, a document in "tech" class may also contain a small fraction of this topic. A classifier such as SVM could infer the complex decision boundaries automatically.

how do I come about to generate document embeddings that I can use for downstream tasks

Currently I only tried using topic proportions as the document embeddings. Jointly using topic proportions and topic embeddings is conceptually possible, for example we could keep the top 3 topics (with highest proportions), say their embeddings are t1,t2,t3 and proportions are a1,a2,a3. Then we could represent the document as [a1*t1, a2*t2, a3*t3], i.e. concatenating the three embeddings into one longer embedding. If the original topic embedding is 500-dimensional, then the document embedding would be 1500-dimensional. But I haven't tried the performance of this representation. One caveat is you have to consider the shuffling of topic orders. For example [t1,t2,t3] should be equivalent to [t3,t2,t1], since topics are unordered in a document.

bwang482 commented 7 years ago

Currently I only tried using topic proportions as the document embeddings. Jointly using topic proportions and topic embeddings is conceptually possible, for example we could keep the top 3 topics (with highest proportions), say their embeddings are t1,t2,t3 and proportions are a1,a2,a3. Then we could represent the document as [a1t1, a2t2, a3*t3], i.e. concatenating the three embeddings into one longer embedding. If the original topic embedding is 500-dimensional, then the document embedding would be 1500-dimensional. But I haven't tried the performance of this representation. One caveat is you have to consider the shuffling of topic orders. For example [t1,t2,t3] should be equivalent to [t3,t2,t1], since topics are unordered in a document.

Ah cool, thus why the Word Mover's Distance paper is cited in the conclusions of your paper. By topic proportions, I am guessing you meant docs_Em? Or perhaps normalised docs_Em so it is a prob distribution sums to 1?

Hmm, not sure what you meant by the shuffling of topic orders? So the topic order in best.topic.vec and last.topic.vec is different to the topic order in docs_Em? If so how do I make sure they are the same?

Thank you very much!

askerlee commented 7 years ago

By topic proportions, I am guessing you meant docs_Em? Or perhaps normalised docs_Em so it is a prob distribution sums to 1?

Yeah sorry I was a bit sloppy here. docs_Em should be normalized to get the topic proportions of each document. But in practice, using docs_Em as features, or using docs_Em/len(doc) as features leads to very similar classification performance. So I just used docs_Em to represent the documents.

not sure what you meant by the shuffling of topic orders?

It depends on how you represent the document using multiple embeddings. In the example use I gave above, suppose we represent a document as a long vector concatenated from multiple (weighted) topic embeddings, let's refer to it as a document embedding. Then when you compute the similarity of two document embeddings using cosine similarity, you have to make sure each dimension means the same thing, otherwise the cosine is meaningless. That's why I said you have to take care of the order of topic embeddings in this representation.

But of course you could also regard multiple topic embeddings in a document as a set, in which they are unordered. Then you would use more sophisticated algorithms such as Word Mover's distance to get the similarity of two such sets.

So the topic order in best.topic.vec and last.topic.vec is different to the topic order in docs_Em? If so how do I make sure they are the same?

They are two different sets of topic embeddings. (They are usually very similar but this is not guaranteed) So it's pointless to discuss whether they have the same order of topics. You could use either set, as long as you keep it consistent in that you use the same set of embeddings for training the document classifier (or other tasks) and for testing.

bwang482 commented 7 years ago

You're quite right Lee! Sorry about the confusion, I thought by "shuffling of topic orders" you meant the topic ordering in topic vectors e.g. best.topic.vec is different to the topic order in docs_Em. Thanks for clear that up!