Supervised Latent Dirichlet Allocation for Classification

This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.

Note that this code depends on the GNU Scientific Library.

Compiling

git clone https://github.com/chbrown/slda
cd slda
make

You may need to install the gsl first. E.g., on a Mac:

brew install gsl

Estimation

Estimate the model by executing:

slda est <data path> <label path> <settings path> <alpha> <k> <initialization> <output directory>

<data path> should point to a single file containing your training data.
- This should be a file where each line is of the form:
```
<M> <term_1>:<count> <term_2>:<count> ... <term_N>:<count>
```
- where <M> is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. (For an example, see test/images/train-data.dat.)
<label path> points to a file of labels
- Each line should consist of a single integer, starting with 0, up to C-1, if we have C classes.
- This file should have the same number of lines as the file specified by <data path>.
<settings path> should point to a file with various settings, e.g., settings.txt
<alpha> is a floating point hyperparameter (a prior)
<k> is the number of topics
<initialization> specifies the initialization method. There are three options:
- "seeded"
- "random"
- <model path> (a path to some pre-existing model)
<output directory> should point to a directory where the estimator's output will be stored. This directory will be created if it does not already exist.
- The estimator outputs models in two types of files:
  - <iteration>.model is the model saved in the binary format, which is easy and fast to use for inference.
  - <iteration>.model.text is the model saved in the text format, which is convenient for printing topics or further analysis using a scripting language.
- It also produces variational posterior Dirichlets in a file called:
  - <iteration>.gamma
- Running the estimator on the 8-class image dataset produces the output:
```
010.gamma
010.model
010.model.text
020.gamma
020.model
020.model.text
final.gamma
final.model
final.model.text
likelihood.dat
word-assignments.dat
```

Example usage:

./slda est test/images/train-data.dat test/images/train-label.dat \
    settings.txt 1.0 10 random tmp/

Inference

To perform inference on a different set of data (in the same format as for estimation), execute:

slda inf <data path> <label path> <settings path> <model path> <output directory>

<data path>, <label path>, and <settings path> are all the same as in the estimation step.
<model path> is the binary final.model file from the estimation step.
<output directory> is the output directory, where the predicted labels will be stored.
- Each output file has one line per input document.
  - inf-gamma.dat describes the variational posterior Dirichlets
  - inf-labels.dat displays the predicted labels
  - inf-likelihood.dat depicts each document's likelihood

Example usage:

./slda inf test/images/test-data.dat test/images/test-label.dat \
    settings.txt tmp/final.model tmp/

This will also produce a final line of output, evaluating against the labels specified in the <label path> argument:

average accuracy: 0.679

Sample data

The sample data in test/images was downloaded from http://www.cs.cmu.edu/~chongw/data/images.tgz on July 12, 2013.

Description of data from original site:

A preprocessed 8-class image dataset from Labelme.

UIUC Sports annotation files: annotations and meta information. The source image files can be found here. (Note: there might be some discrepancies and I don't seem to know why...)

License

Licensed under both the GPL v2 and GPL v3, as well as any future version of the GNU General Public License.

chbrown / slda

readme