can't reproduce the preprocessed data

quynhneo commented 3 years ago

Hi there, I ran https://github.com/adjidieng/DETM/blob/master/scripts/data_undebates.py on the kaggle data for un debates (as link in your paper: https://www.kaggle.com/unitednations/un-general-debates) but I am unable to reproduce the preprocessed data you linked here https://bitbucket.org/franrruiz/data_undebates_largev/src/master/ (variables in .mat files are different from yours) . Any idea? There is not much setting beside min_df and max_df. I used the default, perhaps you used something else?

mona-timmermann commented 3 years ago

Might be too obvious, but could it just be because of the random permutation with no seed? Apart from that, I've observed a lot of things I had to change in the code to get it to run and to implement the model as described in the paper. I was never able to reproduce the results using the original code.

quynhneo commented 3 years ago

hm...possibly. Same here on having to change a lot. Perhaps we should submit some PRs.

Emekaborisama commented 3 years ago

Let's work on converting it to a python library @quynhneo @mona-timmermann

What do you think?

Although I notice a new error that occurs on a large dataset

quynhneo commented 3 years ago

Not a bad idea ... Ideally we have @adjidieng supports the idea .

Emekaborisama commented 3 years ago

I can talk to @adjidieng tomorrow and i will keep you in touch with her response

wyt? @mona-timmermann

Emekaborisama commented 3 years ago

Adji said we can proceed but we will upload the package as a branch on this repo. @quynhneo @mona-timmermann lets get this done

yangyijane commented 3 years ago

@Emekaborisama Hi any updates on the python script to reproduce this study? thank you very much.

yangyijane commented 3 years ago

that's cool. thx.

On Wed, Feb 3, 2021 at 4:47 PM Quynh M. Nguyen notifications@github.com wrote:

I have made it to work, see my fork https://github.com/quynhneo/DETM

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adjidieng/DETM/issues/10#issuecomment-772846227, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAUW4ZROIQ2K2VNOQ5ONMDS5G76XANCNFSM4T2WUOAA .

yangyijane commented 3 years ago

Hi Mr Nguyen,

I have a follow-up question regarding the script running DETM after you preprocessing all your data. I checked your script and you split the data into training vs testing set.

Why did you do that? I thought it is supposed to be unsupervised learning? Thank you very much.

On Wed, Feb 3, 2021 at 8:58 PM It’s Jenny’s Wonderland yangyijane@gmail.com wrote:

that's cool. thx.

On Wed, Feb 3, 2021 at 4:47 PM Quynh M. Nguyen notifications@github.com wrote:

I have made it to work, see my fork https://github.com/quynhneo/DETM

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adjidieng/DETM/issues/10#issuecomment-772846227, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAUW4ZROIQ2K2VNOQ5ONMDS5G76XANCNFSM4T2WUOAA .

quynhneo commented 3 years ago

according to the paper, they calculate perplexity using test documents.

adjidieng / DETM

can't reproduce the preprocessed data #10