easselin / sofia-ml

Automatically exported from code.google.com/p/sofia-ml
0 stars 0 forks source link

sofia-ml

Project homepage: http://code.google.com/p/sofia-ml/

==Introduction==

The suite of fast incremental algorithms for machine learning (sofia-ml) can be used for training models for classification or ranking, using several different techniques. This release is intended to aid researchers and practitioners who require fast methods for classification and ranking on large, sparse data sets.

Supported learners include:

* Pegasos SVM
* Stochastic Gradient Descent (SGD) SVM
* Passive-Aggressive Perceptron
* Perceptron with Margins
* ROMMA 

These learners can be configured for classification and ranking, with several sampling methods available.

This implementation gives very fast training times. For example, 100,000 Pegasos SVM training iterations can be performed on data from the CCAT task from the RCV1 benchmark data set (with roughly 780,000 examples) in 0.1 CPU seconds on an ordinary 2.4GHz laptop, with no loss in classification performance compared with other SVM methods. On LETOR learning to rank benchmark tasks, training time with 100,000 Pegasos SVM rank steps complete 0.2 CPU seconds on an ordinary laptop.

The primary computational bottleneck is actually reading the data off of disk; sofia-ml reads and parses data from disk substantially faster than other SVM packages we tested. For example, sofia-ml can read and parse data nearly 10 times faster than the reference Pegasos implementation by Shalev-Shwartz, and nearly 3 times faster than svm_perf by Joachims.

This package provides a commandline utility for training models and using them to predict on new data, and also exposes an API for model training and prediction. The underlying libraries for data sets, weight vectors, and example vectors are also provided for researchers wishing to use these classes to implement other algorithms.

==Quick Start==

These quick-start instructions assume the use of the unix/linux commandline, with g++ installed. There are no external code dependencies.

Step 1 Check out the code:

svn checkout http://sofia-ml.googlecode.com/svn/trunk/sofia-ml sofia-ml-read-only

Step 2 Compile the code:

cd sofia-ml-read-only/src/ make ls ../sofia-ml

Executable should be in main sofia-ml-read-only directory.

If the above did not succeed, run the unit tests to help locate the problem:

make clean make all_test

Step 3 Test the code:

cd .. ./sofia-ml

This should display the set of commandline flags and descriptions.

Train a model on the demo training data.

./sofia-ml --learner_type pegasos --loop_type stochastic --lambda 0.1 --iterations 100000 --dimensionality 150000 --training_file demo/demo.train --model_out demo/model

This should display something like the following:

Reading training data from: demo/demo.train Time to read training data: 0.056134 Time to complete training: 0.075364 Writing model to: demo/model Done.

Test the model on the demo data.

./sofia-ml --model_in demo/model --test_file demo/demo.train --results_file demo/results.txt

Should display the following:

Reading model from: demo/model Done. Reading test data from: demo/demo.train Time to read test data: 0.046729 Time to make test prediction results: 0.000844 Writing test results to: demo/results.txt Done.

Examine a few results in the results file:

head -5 demo/results.txt

Format is: \t

file corresponds to the same line (in order) in the test file.

1.02114 1 1.18046 1 -1.24609 -1 -1.12822 -1 -1.41046 -1

Note that exact results may vary slightly because these algorithms train

by randomly sampling one example at a time.

Evaluate the results:

perl eval.pl demo/results.txt

Should display something like:

Results for demo/results.txt:

Accuracy 0.9880 (using threshold 0.00) (988/1000) Precision 0.9719 (using threshold 0.00) (311/320) Recall 0.9904 (using threshold 0.00) (311/314) ROC area: 0.999406

Total of 1000 trials.

Note that this evaluation script has limited functionality. For more

options, we recommend using the perf software by Rich Caruana (developed fo

the KDD Cup 2004), available at: http://kodiak.cs.cornell.edu/kddcup/software.html

==Data Format==

This package uses the popular SVM-light sparse data format.

: ... :\n qid: : ... :\n : ... :# Optional comment or extra data, following the optional "#" symbol.\n The feature id's are expected to be in ascending numerical order. The lowest allowable feature-id is 1 (0 is reserved for the bias term internally.) Any feature not specified is assumed to have value 0 to allow for sparse representation. The class label for test data is required but not used; it's okay to put in a dummy placeholder value such as 0 for test data. For binary-class classification problems, the training labels should be 1 or -1. For ranking problems, the labels may be any numeric value, with higher values being judged as "more preferred". Currently, the comment string is not used. However, it is available for use in other algorithms, and can also be useful to aid in bookkeeping of data files. Examples: # Class label is 1, feature 1 has value 1.2, feature 2 (not listed) has value 0, # and feature 3 has value -0.5. 1 1:1.2 3:-0.5 # Class label is -1, belongs to qid 3, and all feature values are zero except # for feature 5011 with value 1.2. -1 qid:3 5011:1.2 # Class label is -1, feature 1 has value 7, comment string is # "This example is especially interesting." -1 1:7 3:-0.5#This example is especially interesting. ==Commandline Details== File Input and Output --model_in * Read in a model from this file before doing any training or testing. --model_out * Write the model to this file when done with everything. --training_file * File to be used for training. When set, causes model training to occur. --test_file * File to be used for testing. When set, causes current model (either loaded from --model_in or trained from --training_file to be tested on test data. --results_file * File to which to write predictions, when --test_file is used. Results for each line are in the format \t