MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
718 stars 102 forks source link

There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words? #86

Open Zay-Ben opened 1 year ago

Zay-Ben commented 1 year ago

There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words?

I created a Dataframe as follows:

df = pd.DataFrame(data = output["topic-word-matrix"], columns = dataset.get_vocabulary()).T

When I sort the Dataframe by a topic number to get the top words for a topic, why do the results differ from output["topics"][i]?

Thank you!

silviatti commented 1 year ago

There should be a one-to-one correspondence between the two. It's difficult to say what is wrong. Can you share more details about the problem?

Zay-Ben commented 1 year ago

Good day Dr. Silvia, nice to see you again, and thank you for reply. Here are the details of the issue. :)

First, I created a dataset folder containing two files, namely corpus.txt and vocabulary.tsv as the OCTIS module required.

The corpus file:

image

The vocabulary file (sorted alphabetically):

image

Second, I loaded the dataset and trained LDA models with the dataset.

image

image

image

Third, after training, I imported one of the LDA models. With the model’s topic-word-matrix as the data and the dataset’s vocabulary as the column. The resulting data frame is shown in the figure below:

image

Last, the top 5 words of the data frame’s first topic are different from the top 5 words of the model’s first topic.

image

I can't determine why there are discrepancies in the top words of the topics.

With appreciation,

Benz

silviatti commented 1 year ago

Hi Benz, sorry for the late reply. I haven't had time to work on OCTIS these months. There's something weird, I agree. I would suggest two experiments in case you're still interesting in this issue:

Thanks for your patience.

Silvia

Zay-Ben commented 1 year ago

Dear Dr. Silvia,

Thank you for taking the time to address my questions.

Regarding the first question, the results show that the order of the vocabulary before and after importing it using OCTIS is different. The vocabulary was sorted alphabetically before importing and shuffled randomly (seemingly) after importing, as shown in the image with the first five terms of each vocabulary. image image

Regarding the second question, I trained two models (ETM and NMF) using the same dataset and found that the problem persists for NMF, but not for ETM, as shown in the figure below. I noticed that OCTIS's LDA and NMF are both from Gensim. Could this be the source of the error?

ETM: image image

NMF: image image

Just to give context, the dataset consists of tweets that contain customer complaints about telecommunication companies.

Thank you again for your help! Topic modeling has never been easy without OCTIS. 😭

silviatti commented 1 year ago

Hi, just to double-check, when you load the custom dataset, do you have a file in the dataset folder called vocabulary.txt? That should be the vocabulary file were words are sorted alphabetically. I asked this question because I noticed that your file is called "words.txt", so it can be possible that OCTIS doesn't load it.

Let me know :)