MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
718 stars 102 forks source link

Mapping between vocabulary and columns in topic-word-matrix #73

Closed rjsu26 closed 1 year ago

rjsu26 commented 1 year ago

Description

I want to take search query from a user and based on this query, return a list of top 5 topics(out of 50 generated after running the LDA model) which match this query.

What I Did

For this task, I made an all zero list of size len(vocabulary.txt) and made the indices corresponding to the search query as 1, i.e

search_vec = [0]*len(vocabulary)
for word in query:
       if word in vocabulary:
           idx = vocabulary.index(word)
           search_vec[idx] = 1
# N-hot encoding complete

I later ran some Nearest Neighbor functions using topic-words-matrix as original data while search_vec as my query vector. The problem here is, as I figured out, the ordering of words in vocabulary list and that used to create the topic-word-matrix are not the same.

How do I get that ordering? Is there any method to give me the index of word in vocabulary which was used as a column in the topic-word-matrix?

silviatti commented 1 year ago

Hello, when you train a topic model, you initialize the dataset first. This dataset has a vocabulary (the indices correspond to the vocabulary of topic-words-matrix). You can get it in the following way:

dataset = Dataset()
dataset.load_custom_dataset_from_folder("dataset_folder") # or your preferred way to initialize the dataset
vocabulary = dataset.get_vocabulary()

Hope this helped. Thanks for your patience,

Silvia