There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words? - Githubissues

MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

MIT License

718 stars 102 forks source link

There is a mismatch between `output["topic-word-matrix"]` and `dataset.get_vocabulary()` in terms of words? #86

Open Zay-Ben opened 1 year ago

Zay-Ben commented 1 year ago

There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words?

I created a Dataframe as follows:

df = pd.DataFrame(data = output["topic-word-matrix"], columns = dataset.get_vocabulary()).T

When I sort the Dataframe by a topic number to get the top words for a topic, why do the results differ from output["topics"][i]?

Thank you!

silviatti commented 1 year ago

There should be a one-to-one correspondence between the two. It's difficult to say what is wrong. Can you share more details about the problem?

Zay-Ben commented 1 year ago

Good day Dr. Silvia, nice to see you again, and thank you for reply. Here are the details of the issue. :)

First, I created a dataset folder containing two files, namely corpus.txt and vocabulary.tsv as the OCTIS module required.

The corpus file:

The vocabulary file (sorted alphabetically):

Second, I loaded the dataset and trained LDA models with the dataset.

Third, after training, I imported one of the LDA models. With the model’s topic-word-matrix as the data and the dataset’s vocabulary as the column. The resulting data frame is shown in the figure below:

Last, the top 5 words of the data frame’s first topic are different from the top 5 words of the model’s first topic.

I can't determine why there are discrepancies in the top words of the topics.

With appreciation,

Benz

silviatti commented 1 year ago

Hi Benz, sorry for the late reply. I haven't had time to work on OCTIS these months. There's something weird, I agree. I would suggest two experiments in case you're still interesting in this issue:

Can you also print out dataset.get_vocabulary()? Just to see if the vocabulary match with your file.
Could you try to repeat the experiment with another model and see if you have the same problem? I'd like to see if the problem is only of LDA or it's general.

Thanks for your patience.

Silvia

Zay-Ben commented 1 year ago

Dear Dr. Silvia,

Thank you for taking the time to address my questions.

Regarding the first question, the results show that the order of the vocabulary before and after importing it using OCTIS is different. The vocabulary was sorted alphabetically before importing and shuffled randomly (seemingly) after importing, as shown in the image with the first five terms of each vocabulary.

Regarding the second question, I trained two models (ETM and NMF) using the same dataset and found that the problem persists for NMF, but not for ETM, as shown in the figure below. I noticed that OCTIS's LDA and NMF are both from Gensim. Could this be the source of the error?

ETM:

NMF:

Just to give context, the dataset consists of tweets that contain customer complaints about telecommunication companies.

Thank you again for your help! Topic modeling has never been easy without OCTIS. 😭

silviatti commented 1 year ago

Hi, just to double-check, when you load the custom dataset, do you have a file in the dataset folder called vocabulary.txt? That should be the vocabulary file were words are sorted alphabetically. I asked this question because I noticed that your file is called "words.txt", so it can be possible that OCTIS doesn't load it.

Let me know :)