derekgreene / topic-model-tutorial

Tutorial on topic models in Python with scikit-learn
156 stars 50 forks source link

topic-model-tutorial

This repository contains notebooks, slides, and data for the short tutorial "Topic modelling with Scikit-learn", presented at PyData Dublin in September 2017.

Contents

The summary tutorial is covered in these slides. There are three associated IPython notebooks:

  1. Text Preprocessing: Provides a basic introduction to preprocessing documents with scitkit-learn.
  2. NMF Topic Models: Covers the application and interpretation of topic models via the NMF implementation provided by scitkit-learn.
  3. Parameter Selection for NMF: More advanced material on selecting the number of topics for NMF, using topic coherence.

To demonstrate the topic modelling techniques, a sample dataset is provided here. This consists of 4,551 news articles collected from the Guardian News API in 2016, stored in a single text file (25MB), with one article per line.

Dependencies

This code has been tested with Python 3.6-3.8. The core package requirements are:

The model selection code also relies on the gensim package to build a Word2Vec model (tested with v4.1.2). A sample pre-built Word2Vec model for the sample dataset is also provided here for download (71MB).

Links and References