Assess the feasibility of trying all the sci-kit learn classifiers

BenKaehler / short-read-tax-assignment

A repository for storing code and data related to a systematic comparison of short read taxonomy assignment tools

BSD 3-Clause "New" or "Revised" License

0 stars 1 forks source link

Assess the feasibility of trying all the sci-kit learn classifiers #5

Open BenKaehler opened 8 years ago

BenKaehler commented 8 years ago

Nick suggests a tiered approach. Ben will do a survey and see how many classifiers there are and whether it could be done.

BenKaehler commented 8 years ago

The problem with this approach is not so much that there are a dozen classifiers, but that each classifier can be deployed in many different ways. For instance, there are over ten ways of preprocessing features, three ways of defining the problem (multiclass, multiclass multioutput, multiclass hierarchical), two ways of calculating confidence, any number of k-mer length choices, boosting and bagging. Some of the methods can be combined.

Firstly, I will list what I think are the relevant classifiers in scikit-learn. Then I will list the modifications that could be applied to most or all of the scikit-learn classifiers. I have called them "multipliers" because they add dimensions to our search space. The following makes sense in the context of the scikit-learn supervised learning documentation.

Classifiers:

Generalised linear models
- Logistic regression
Linear and quadratic discriminant analysis
Support vector machines
- SVC
- Nu-SVC
- Linear SVC
Stochastic gradient descent
Nearest neighbours
- K-Neighbours
- Radius Neighbours
Gaussian process
Naive Bayes
- Multinomial
- Bernoulli
Decision trees
Ensemble methods
- Random forests
- Extremely randomised trees
- AdaBoost
- Gradient tree boosting
Neural network models

Multipliers:

Multioutput (aka multilabel)
Hierarchical
Bagging
Voting classifier
Bootstrap confidence
Calibrated probability confidence
Dimension reduction
- Latent semantic analysis
- Non-negative matrix factorisation
- Linear discriminant analysis
Feature selection
- Variance threshold
- Percentile
- K best
- FPR
- FDR
- FWE
- Tree-based
Word length
Feature extraction
- Dict vectorisation
- Feature hashing
- Count vectorisation
- Tf–idf term weighting

nbokulich commented 8 years ago

I agree, for the first sweep we need to trim this down to just the multipliers that we think will affect performance, and just the classifiers that we think may perform differently. We should remember that, following the two-tiered approach that we discussed, we want to do a very broad sweep in the first tier, comparing as many methods as possible but possibly fewer parameters/multipliers. The latter can be expanded in the second tier, as we focus on the "best" methods. One way to reframe this problem (unless if I am missing the point here) is that tier 1 is classifier selection and tier 2 is multiplier (or parameter) selection (with the secondary goal of selecting the best of the best).

Classifiers: I am not familiar with most of these classifiers and I realize that we probably can't make many assumptions about performance. However, if we can answer any of the following, we can probably cross some classifiers off our list:

How has classifier x performed for other (ideally similar) problems?
Do memory requirements or relative runtimes differ between these classifiers? I suspect some benchmarking data may exist somewhere — and speed is an important factor for our problem.

Multipliers: The key is not only what will affect performance, but what we predict may affect performance unevenly for different classifiers (this is also relevant if we think multiplier x will affect performance dramatically, but is only applicable to method x, not method y). In other words, if multiplier x may be expected to influence performance evenly across all methods, we can set it as a constant; only test the multipliers that we think will not behave the same for different classifiers in tier 1 phase. You would know much better than I would, but my estimate is that most of these multipliers would be most appropriate to leave for tier 2 (or even a third tier, where we optimize the best of the best classifier), unless if any are critical for the performance of one classifier but not another.

BenKaehler commented 8 years ago

Thanks Nick. Good points all. I have had a very interesting meeting with Stephen Gould. I will summarise my understanding of what he told me, in no particular order:

Choice of classifier is not critical. Most have similar performance given optimised meta-parameters.
It often is not important to optimise every meta-parameter for a given classifier. It is often good enough to choose reasonable values for most then optimise the most important meta-parameter.
Dimension reduction and feature selection can be safely disregarded.
Hierarchical classification is not ideal but might be worth trying. A better approach might be structured prediction.
Getting feature extraction right is important.
Word length optimisation should be revisited.
Using probability calibration for confidence is probably ok.
There isn't a practical, exhaustive solution to our current problem with all our combinations of classifiers and multipliers.
It could be that RDP Classifier is already close to optimal and that predicting species level taxonomies from amplicons is not feasible.
A good approach would be to pick a few classifiers, get feature extraction right, then perform a broad sweep over the available classifiers.

This final point is close to your ideas about tiered testing. I have not yet had time to digest what Steve told me about structured prediction. I think it's a way that we can utilise taxonomy tree information. It isn't supported in scikit-learn, but there is another package that may be appropriate.

So I think a sensible tiered approach might be:

Test a number of sensible multiplier combinations on our three favourite classifiers.
Pick the top few multiplier methods and sweep all of the available classifiers.
Fine tune the best multiplier and classifier combination.

BenKaehler commented 8 years ago

For Tier 1, I suggest eight scikit-learn classifiers and two PyStruct classifiers. I suggest that in all cases we preprocess the data using feature_extraction.text.HashingVectorizer followed by feature_extraction.text.TfidfTransformer. The eight scikit-learn classifiers should be

linear_model.LogisticRegression
naive_bayes.MultinomialNB
svm.SVC
ensemble.RandomForestClassifier

once with multioutput labels, once without for each. The two PyStruct classifiers should be

models.MultiClassClf
models.MultiLabelClf

I have dropped bagging and voting classifiers. I only included them initially because I saw them in the scikit-learn docs.

I have dropped feature selection and dimension reduction, as per Steve's recommendations.

I will implement calibrated probability confidence estimates because I doubt that the bootstrap approach of RDP will generalise successfully to the classifiers we intend to test.

Finally, I have dropped hierarchical classification for the moment. We have enough work to do, I'm not currently sufficiently confident or desperate to invent new classification methods, and PyStruct seems like a better way to do it anyway.

A wide meta-parameter search of the above methods (including the preprocessors) should cover the remaining multipliers, that is all the feature extraction methods and the multioutput option.

So, are ten classifiers too many for the first tier?

nbokulich commented 8 years ago

This sounds like a great approach to me, and all very good rationale. Ten sounds like a good number for tier1.

BenKaehler commented 7 years ago

Bad news: pystruct is BSD-licensed, but it won't run without cvxopt or ad3, both of which are GPL-licensed.

@gavinhuttley suggests keeping the structured learning classifiers in the tests, for the science of it (thanks Gavin). He also made the point that if we see dramatic performance gains with the structured learning classifiers we can then allocate or attempt to obtain more resources to overcome the problems that come with the GPL licenses.

Fair enough, I'll keep pystruct in for the moment, but pystruct should know that it's skating on thin ice and if it doesn't behave itself it will be relegated and hierarchical classification will be reinstated.