Input data format - Githubissues

bwang482 commented 7 years ago

Thanks very much for your work Ardavan!

I have noticed sHDP requires a picked data file and a pickled word embedding in dict format. I have a bunch of tweets that I want to test on your model (one tweet per line). May I ask how do I come about transforming it to the required format please? In texts, e.g. does ('time', 2) mean the token 'time' occurred twice in the whole corpus? and each list of tuples is a tokenised document? Un-ordered? Is vectors_dict merely subsampled pre-trained word2vec embeddings as it only has 4768 items?

Thanks again!

Ardavans commented 7 years ago

Thank you for your interest :)

As you said there are two main pickle files in the code.

wordvec.pk contains the subset of the words from glove.word2vec that are available in your corpus (i.e. tweets). You may want to do some standard preprocessing to make sure you have a good quality corpus and then only make the wordvec.pk for the words in the clean dataset.

texts.pk converts each document to a "bag of words" list. That is, for each document you have a list containing pairs of words and their frequency in that document. You can convert a document to this format using the NLTK package in python.

Let me know if this doesn't answer your question or you have any other questions.

Best, Ardavan

On Thu, Apr 20, 2017 at 6:17 PM, bluemonk482 notifications@github.com wrote:

Thanks very much for your work Ardavan!

I have noticed sHDP requires a picked data file and a pickled word embedding in dict format. I have a bunch of tweets that I want to test on your model (one tweet per line). May I ask how do I come about transforming it to the required format please? Is vectors_dict merely subsampled glove.twitter.word2vec.27B.50d ?

Thanks again!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Ardavans/sHDP/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkD5b9t6wwI9I9O_iMPLTc5WP8oDrwjks5rx9lpgaJpZM4NDpL6 .

bwang482 commented 7 years ago

Thanks for you reply Ardavans! Jus want to re-confirm on the two points.

Hmm I originally thought each document of texts.pk contains pairs of words and their frequency in the whole corpus as it is usually the case right? I have a collection of tweets and each tweet is considered as a document. But I think in my case each word should be paired with its frequency in the whole collection do you agree?

For wordvec.pk, did you mean if I want to train my own word2vec model I should preprocess my training corpus (tweets in this case) beforehand right?

Best, Bo

Ardavans commented 7 years ago

No problem!

Hmm I originally thought each document of texts.pk http://texts.pk/ contains pairs of words and their frequency in the whole corpus as it is usually the case right? I have a collection of tweets and each tweet is considered as a document. But I think in my case each word should be paired with its frequency in the whole collection do you agree?

I think it should be the frequency of the words in each document. See for instance the nips dataset available in the repo. There are 1566 documents and for instance in the first document word 'weights' occurs 7 times.

For wordvec.pk http://wordvec.pk/, did you mean if I want to train my own word2vec model I should preprocess my training corpus (tweets in this case) beforehand right?

I meant to say even after training there might be some words that are less important for you; you can remove those with some preprocessing. However, if you've already taken care of those words in training then you won't need to that again.

Let me know if you have any questions.

Best, Ardavan

On Sat, Apr 22, 2017 at 1:13 PM, bluemonk482 notifications@github.com wrote:

Thanks for you reply Ardavans! Jus want to re-confirm on the two points.

Hmm I originally thought each document of texts.pk contains pairs of words and their frequency in the whole corpus as it is usually the case right? I have a collection of tweets and each tweet is considered as a document. But I think in my case each word should be paired with its frequency in the whole collection do you agree?

For wordvec.pk, did you mean if I want to train my own word2vec model I should preprocess my training corpus (tweets in this case) beforehand right?

Best, Bo

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ardavans/sHDP/issues/2#issuecomment-296387822, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkD5WxEU3QPuJktXWZfcY2FBMeYFXEJks5ryjU_gaJpZM4NDpL6 .

bwang482 commented 7 years ago

Thanks very much Ardavans!

Indeed, you're right. But is the frequency of words used for normalising its word vectors here or is it being used for something else? I have quickly checked your code and I can't see where such frequency is used in any way?

Also, do I need to normalise my glove vectors to unit L2 norm beforehand or this normalisation is already integrated in your code? Thanks :)

Ardavans commented 7 years ago

As far as I remember, the frequency is used for the variational inference updates and not for normalizing the word vectors.

Also I think we did normalize the glove vectors in the code but there is no harm in doing it again if you are not doing it already :)

On Sun, Apr 23, 2017 at 10:17 AM, bluemonk482 notifications@github.com wrote:

Thanks very much Ardavans!

Indeed, you're right. But is the frequency of words used for normalising its word vectors here or is it being used for something else? I have quickly checked your code and I can't see where such frequency is used in any way?

Also, do I need to normalise my glove vectors to unit L2 norm beforehand or this normalisation is already integrated in your code? Thanks :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ardavans/sHDP/issues/2#issuecomment-296446347, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkD5Y1x6x8OKHp3ovbZ4Z56vFJDsNq8ks5ry11sgaJpZM4NDpL6 .

bwang482 commented 7 years ago

Thanks !! :+1:

Final question :P , is there any necessity or caveat for parameter tuning for tweets? apart from Nmax and mbsize which I need to adjust for my use.. I would prefer to use default settings obvs if these are empirically found to be the most effective and no significant change will be seen on a different data

./runner.py -is 1 -alpha 1 -gamma 2 -Nmax 40 -kappa_sgd 0.6 -tau 0.8 -mbsize 10 -dataset twitter

Ardavans commented 7 years ago

No problem :)

Unfortunately, that's a really hard question to answer as I haven't tried the code on very short documents like tweets. I guess alpha and gamma parameters may also need to be changed since you don't expect to see so many topics in a single tweet. However, I might be wrong and it may also work with the default setting.

On Sun, Apr 23, 2017 at 11:52 AM, bluemonk482 notifications@github.com wrote:

Thanks !! 👍

Final question :P , is there any necessity or caveat for parameter tuning for tweets? apart from Nmax and mbsize which I need to adjust for my use.. I would prefer to use default settings obvs if these are empirically found to be the most effective and no significant change will be seen on a different data

./runner.py -is 1 -alpha 1 -gamma 2 -Nmax 40 -kappa_sgd 0.6 -tau 0.8 -mbsize 10 -dataset twitter

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ardavans/sHDP/issues/2#issuecomment-296452701, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkD5RQK5S28DGFFRNyAqPIyYkCc_55Zks5ry3O_gaJpZM4NDpL6 .

bwang482 commented 7 years ago

Thanks for all the help Ardavans!! :)

I shall close this issue now.

Ardavans commented 7 years ago

No problem at all! Good luck with your project!

On Sun, Apr 23, 2017 at 12:06 PM, bluemonk482 notifications@github.com wrote:

Thanks for all the help Ardavans!! :)

I shall close this issue now.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ardavans/sHDP/issues/2#issuecomment-296453534, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkD5W3qnK6dsTOvyHLMjMeMHcDPLqoeks5ry3b1gaJpZM4NDpL6 .

Ardavans / sHDP

Input data format #2