DARIAH-DE / TopicsExplorer

Explore your own text collection with a topic model – without prior knowledge.
https://dariah-de.github.io/TopicsExplorer
Apache License 2.0
62 stars 10 forks source link

Changing Results and Dominance Score #126

Open Glorifier85 opened 3 years ago

Glorifier85 commented 3 years ago

Hi community,

whenever I run the same corpus with exact the same parameters, I get different results, e.g. ranking of topics. I assume that one could minimze that through thorough data cleansing? Besides that, are there additional methods to increase reliability of results?

My second questions revolves around the dominance score, which is used to rank the topics. How exactly is each score calculated? I'm asking because I have run corpuses in the past, exported the results and added up the numbers in the topic distribution spreadsheet for each topic (I assume these are the dominance scores). My understanding is that these should match the visual ranking of the topics but sometimes, they did not for me. Shouldnt whatever topic rankes first in the graphic also have the highest combined dominance score of all documents or is there more to it than just adding them up? Most of the time, it does match but sometimes, it just doesnt.

Thanks!

severinsimmler commented 3 years ago

Topic modeling is probabilistic. Two probability distributions are iteratively estimated:

  1. How likely is a word for a topic? The e.g. ten most likely words are then usually interpreted as a "topic".
  2. How likely is a topic for a document? This probability is the dominance score. I can recommend this paper by David Blei – it deals with the mathematical details in a rather comprehensible way.

Both distributions are initialized randomly. Because of this randomness, you will never get the exact same (but still comparable) results with the same texts and the same parameters, because we do not set random seed in the application. You can more or less call this a bug.

Historically, the Topics Explorer was developed for didactic purposes, with which one is introduced to the method as fast and straightforward as possible. But since you obviously have an advanced and complex use case, you could switch to MALLET, a command line based tool which is in my opinion also quite easy to use (but has no graphical interface). With MALLET you can explicitly set a random seed and get deterministic results. The output is similar to the Topics Explorer text files with the topics and distributions.

Glorifier85 commented 3 years ago

I could switch applications (and probably will) but I need one with a GUI. I am considering the Stanford Topic Modeling Toolbox, which I have heard positive things about.

Based on your explanations, the numbers I find in the topic distribution output are the dominance scores. But then why does the Topic Explorer not rank the topic with the highest numerical sum of all dominance scores across all documents also as the leading/most prominent topic? With my dataset, the topic with the highest dominance scores is placed 4th by the app. That still doesnt make sense to me...

Thanks!