Flat topic distributions in author-topic model

dongwookim-ml / python-topic-model

Implementation of various topic models

Apache License 2.0

369 stars 172 forks source link

Flat topic distributions in author-topic model #8

Closed olavurmortensen closed 7 years ago

olavurmortensen commented 7 years ago

I tried running the author-topic model notebook. I noticed that the topic distributions of many authors were flat, meaning that all topics were equally likely. See example below.

I did not change the notebook in any way, so I suspect there is some error in the algorithm/code, although I have no inkling of what it might be. Just thought I'd share.

Is someone else able to reproduce this, or is it just me? Or did I misunderstand, and this is actually expected to happen?

dongwookim-ml commented 7 years ago

Hi, I cannot reproduce the same result so far. Did you find that all other others also exhibit the uniform distribution? If that's not the case, the sampler probably assigns zero token to the author.

olavurmortensen commented 7 years ago

@arongdari No, not all the authors exhibit the uniform distribution, but it seems most of them do.

Just tried again on a different machine, starting from a clean virtual environment, same result.

Did you try to plot some other distributions? The ones that you plot in the notebook, 7 and 32, they're fine, it was when I tried plotting others I discovered something was wrong.

dongwookim-ml commented 7 years ago

Sorry for the late reply. There are two possible reasons for this.

There is no document written by that author (due to some data cleaning steps)
No tokens are assigned to that author (with very low probability)

I just check the author with author_id = 1 who also has flat distribution, and it turns out that there is no document written by this author. So it corresponds to case 1 here. I suspect that the case 2 will not occur very frequently unless an author wrote only single document which is again written by many authors.

olavurmortensen commented 7 years ago

You're probably right. I tried it myself on a different dataset (NIPS) and am not experiencing this problem.