MichaelAquilina / Reddit-Recommender-Bot

Indentifying Interesting Documents for Reddit using Recommender Techniques
7 stars 0 forks source link

One class Learning (OneClassSVM / SVDD) #41

Closed MichaelAquilina closed 10 years ago

MichaelAquilina commented 10 years ago

Investigate the use of one-class learning techniques to solve data issue of not being able to properly represent "negative" examples.

MichaelAquilina commented 10 years ago

Scikit learn already supports one class classification with Support Vector Machines with the OneClassSVM class. This will be very useful for the task at hand and should be evaluated with some simple techniques.

MichaelAquilina commented 10 years ago

Till now one class learning is proving to be very unaccurate but this could be due to the fact that the parameters being chosen are not ideal. In addition to this, it is possible that a larger amount of data would be required to provide better accuracy.

MichaelAquilina commented 10 years ago

Resouces:

MichaelAquilina commented 10 years ago

OneClassSVM is based on Sckolkopfs hyperplane solution. This is probably not ideal and why you are not getting very good results. Tax and Duin's extension improves this notation by trying to enclose all training data within a hypersphere. This technique is called "Support Vector Data Description" (SVDD). Unfortunately, this is not implemented in the sklearn package.

MichaelAquilina commented 10 years ago

Todo list for implementaiton of SVDD in python:

This is if you decide to implement SVDD as an alternative to OneClassSVM

MichaelAquilina commented 10 years ago

Accuracy problem with OneClassSVM (and potentially SVDD) is probably due to the fact that we are working with extremely high dimensional data. See this stackoverflow post for a discussion on why.

MichaelAquilina commented 10 years ago

subreddit=python

OneClassSVM(nu=0.001, kernel='linear') ~88% accuracy OneClassSVM(nu=0.01, kernel='linear') ~86% accuracy OneClassSVM(nu=0.05, kernel='linear') ~83% accuracy OneClassSVM(nu=0.1, kernel='linear') ~80% accuracy OneClassSVM(nu=0.4, kernel='linear') ~60% accuracy OneClassSVM(nu=0.9, kernel='linear') ~0.9% accuracy

Please note that the results above are only evaluating on positive datasets. Later tests have shown that setting nu to such a low value results in very very bad performance results on classifying negative data

In general, smaller nu leads to greater accuracy for positive examples but this is extremely biased evaluation because we are not considering any negative data.

MichaelAquilina commented 10 years ago

Interesting to note that aggressive pruning seems to improve ML performance for one class classification.

This suggests that the dimensional of the document representation should be kept to a minimum. Additional techniques such as ignoring numbers and conflating domains could possibly help performance.

Once again, this observation is slightly biased because no negative data is being considered for the evaluation of accuracy at this stage.

MichaelAquilina commented 10 years ago

Initial tests show that low nu shows very very bad performance (<1%) for negative examples during evaluation.

MichaelAquilina commented 10 years ago

Another alternative to consider is PU Learning which allows the adaption of supervised learning techniques to work with "one class" data sets.

MichaelAquilina commented 10 years ago

One Class Clustering Idea

Is it possible to somehow solve this problem through some form of clustering? The notion of distance can be used to determine how "alike" two document vectors are. The training data can be used to generate an examplar instance as a combination of all the data points. The maximum (or mean) distance from each training instance to the examplar can be used to describe a hypersphere around the cluster. New instances can then be classified as outliers or inliers based on whether their distance to the examplar value is smaller or larger than the radius generated during training.

MichaelAquilina commented 10 years ago

You should also try giving some importance to PEBL (Positive Example Based Learning) which is very similiar to the problem you are trying to tackle and looks very promising.

MichaelAquilina commented 10 years ago

From what I can tell, TILL NOW generating negative data from the alternative subreddits seems to be performing - ok. Need to perform further tests to validate this but it is a promising approach. Hopefully the use of NLP techniques (wikipedia / wordnet) will improve ML performance.