Topic validation and model choice

The biggest issue is that we don't know what constitutes a good vs bad topic modeling, so I am not sure how to choose between LDA and NMF and I am not sure how to optimize the hyperparmeters. I suggest using science_map.dc5_cluster_assignment_stable to validate clusters. You can use metric:

Metric 1: Extract all paper-pairs that share the topic in the model you are testing. For each paper-pair test if the also share the same cluster. Report the share of paper-pairs from the same topic that have the same cluster.

Metric 2: Extract all paper-pairs that share the same cluster in science_map. For each paper-pair test if the also share the same topic in the model that you are testing. Report the share of paper-pairs that have the same topic and the same cluster. Report the share of paper-pairs from the same cluster that have the same topic.

Average out the results: Metric = (Metric1 + Metric2)/2

Next we need to do a hyper-parameter search of LDA and NMF maximizing this Metric. Rank you topic solution according to this metric and pick the one with the highest. It's Ok if the overlap is low, we don't need the exact replication of the clustering map, we just need a way to rank and optimize topic estimation.

georgetown-cset / unicorn-topics

Topic validation and model choice #9