bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
557 stars 62 forks source link

Removing a topic from a HDPModel #152

Closed bertomartin closed 1 year ago

bertomartin commented 2 years ago

hi, I have a HDP model and I was wondering if there's an easy way to remove a topic from the model. For instance, it's easy to check whether a topic is "live" or "dead" but can you update the model to not include the dead topics then re-save the model artifact? I guess this would also involve removing the tomotopy documents associated with the dead topics.

bab2min commented 2 years ago

Hi @bertomartin As you know, currently tomotopy has no feature about removing dead topics from HDP models. This is because dead and live topics can be swapped out during training, so removing them in the training process causes frequent reallocations and slows down the total training procedure. But if you want to remove dead topics after the whole training finished, that seems a pretty reasonable request. I'll try to implement it in the next update.

bab2min commented 2 years ago

Blueprint of purge_dead_topics method of tomotopy.HDPModel:

model = tp.HDPModel(...)
...
model.train(...)

# model may have a lot of dead topics at this point, e.g.
#  0: live topic
#  1: live topic
#  2: dead topic
#  3: live topic
#  4: dead topic
#  5: dead topic

# purge all dead topics and relocate live topics.
relocate_result = model.purge_dead_topics() 

# `relocate_result` is a array where `relocate_result[i]` has a new topic id for old topic `i`, or -1 if old topic `i` is purged.
# e.g. [0, 1, -1, 2, -1, -1]

assert model.k == model.live_k
# at this point, `model.k` should be equal to `model.live_k`, e.g. model.k == 3, model.live_k == 3
bertomartin commented 2 years ago

@bab2min thanks for the response. Yes I meant to purge them after the model's being built (training is already completed). Your Blueprint makes sense to me. What I'm really after is having a contiguous set of clean topics, so I can do topic similarity and don't try to query a 'dead' topic for similarity. Or just outputting them in pyldavis, I don't want to see the dead topics as it doesn't really add anything...

bertomartin commented 2 years ago

Thank you! In the meantime I was wondering if I could somehow filter out these topics when I do the ldavis display. So basically the plan is to construct the display as below:

topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq

The problem is the docs are not related to K, or at least I don't see how to relate them. Ideally I would only want docs that occur in live topics to be able to get this to work.

bab2min commented 2 years ago

@bertomartin You can filter out dead topics using numpy indexing like:

live_topics = [k for k in range(mdl.k) if mdl.is_live_topic(k)] # topics you want to visualize

topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
topic_term_dists = topic_term_dists[live_topics] # select only `live_topics`
topic_term_dists /= topic_term_dists.sum(axis=1, keepdims=True)

doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists = doc_topic_dists[:, live_topics] # select only `live_topics`
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)

doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq
...

I uploaded a new example cooperating pyldavis and HDPModel. https://github.com/bab2min/tomotopy/blob/main/examples/hdp_visualization.py

bertomartin commented 2 years ago

Sweet! I figured out a hacky way but this looks better. Thank you!