This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.
Note that this code depends on the GNU Scientific Library.
git clone https://github.com/chbrown/slda
cd slda
make
You may need to install the gsl
first. E.g., on a Mac:
brew install gsl
Estimate the model by executing:
slda est <data path> <label path> <settings path> <alpha> <k> <initialization> <output directory>
<data path>
should point to a single file containing your training data.
This should be a file where each line is of the form:
<M> <term_1>:<count> <term_2>:<count> ... <term_N>:<count>
where <M>
is the number of unique terms in the document, and the
[count] associated with each term is how many times that term appeared
in the document. (For an example, see test/images/train-data.dat.)
<label path>
points to a file of labels
<data path>
.<settings path>
should point to a file with various settings, e.g., settings.txt<alpha>
is a floating point hyperparameter (a prior)<k>
is the number of topics<initialization>
specifies the initialization method. There are three options:
<model path>
(a path to some pre-existing model)<output directory>
should point to a directory where the estimator's output will be stored.
This directory will be created if it does not already exist.
<iteration>.model
is the model saved in the binary format, which is easy and
fast to use for inference.<iteration>.model.text
is the model saved in the text format, which is
convenient for printing topics or further analysis using a scripting language.<iteration>.gamma
Running the estimator on the 8-class image dataset produces the output:
010.gamma
010.model
010.model.text
020.gamma
020.model
020.model.text
final.gamma
final.model
final.model.text
likelihood.dat
word-assignments.dat
Example usage:
./slda est test/images/train-data.dat test/images/train-label.dat \
settings.txt 1.0 10 random tmp/
To perform inference on a different set of data (in the same format as for estimation), execute:
slda inf <data path> <label path> <settings path> <model path> <output directory>
<data path>
, <label path>
, and <settings path>
are all the same as in the estimation step.<model path>
is the binary final.model
file from the estimation step.<output directory>
is the output directory, where the predicted labels will be stored.
inf-gamma.dat
describes the variational posterior Dirichletsinf-labels.dat
displays the predicted labelsinf-likelihood.dat
depicts each document's likelihoodExample usage:
./slda inf test/images/test-data.dat test/images/test-label.dat \
settings.txt tmp/final.model tmp/
This will also produce a final line of output, evaluating against the labels
specified in the <label path>
argument:
average accuracy: 0.679
The sample data in test/images was downloaded from
http://www.cs.cmu.edu/~chongw/data/images.tgz
on July 12, 2013.
A preprocessed 8-class image dataset from Labelme.
UIUC Sports annotation files: annotations and meta information. The source image files can be found here. (Note: there might be some discrepancies and I don't seem to know why...)
Copyright © 2009, Chong Wang, David Blei and Li Fei-Fei
Licensed under both the GPL v2 and GPL v3, as well as any future version of the GNU General Public License.