drob-xx / TopicModelTuning

Companion code and data for a Medium article - https://towardsdatascience.com/use-metrics-to-determine-lda-topic-model-size-1a1feaa1ff3c
GNU General Public License v3.0
4 stars 0 forks source link

TopicModelTuning

The has code that parallels the article Using Metrics to Determine The Right LDA Topic Model Size. Users can run the notebook and step-by-step re-create the procedures described in the article.

To run the code presented here, follow this outline (details in the cells below):

  1. Download two csv files from the GitHub repository into a directory accessible to the notebook.
  2. Download the text DB csv file from Kaggle.
  3. Assign the global directory value to the location of the above files.
  4. Install the required packages.
  5. Execute the imports.
  6. Run the cells containing Python function definitions used in the notebook.
  7. Generate the six models used in the evaluation. This shold take about 15 minutes on a standard Google Colab account. You can save the models for later use if desired.
  8. Run the evaluation code.
  9. Download CSV Files

There are three csv files that are needed to run this notebook:

In the GitHub repository:

On Kaggle

ModelRunMetrics are the metrics from 90 runs of the LDA and can be used to re-create and explore the data from the article.

NewsDF is a copy of the 30,000 article DB that has both the original text as well as pre-processed versions of the articles. You will need this if you want to run your own models AND if you want to explore the text that the models are built on.

It is recommended that you place all of these files in a location accessible to the Colab notebook and referenced in the DATA_DIR variable