Open alvations opened 10 years ago
I also stuck with the same problem once then I refer the following points: 1) Regular classification rate (classification accuracy) isn't a good metric, because if you correctly classify only the instances of the majority class (class with many samples), this metric still gives you a high rate. The Area Under the ROC Curve (AUC) is a good metric for evaluation of classifiers in such datasets. 2) You can increase the number of minority class samples by: i) Resampling: bootstrapping samples of the minority class. ii) Oversampling: generate new samples of the minority class, for this, I'd recommend to use SMOTE, SPIDER or any variant. You can also use Prototype Generation (PG) in order to generate new samples of the minority class - there are specific PG techniques for imbalanced datasets such as ASGP and EASGP. 3) You can reduce the number of majority class samples by: i) Random Undersampling. ii) Prototype Selection (PS) to reduce imbalance level, such as One-Sided Selection (OSS). Or, you can use Tomek Links, Edited-Nearest Neighbors (ENN) and other but only remove the majority class outliers. 4) In your K-Fold validation, try to use the same proportion between the classes. If the number of instances of the minority class is too low, you can reduce the number of K, until there are enough. 5) Use Multiple Classifier Systems (MCS): i) Using Ensemble Learning has been proposed as an interesting solution to learn from imbalanced data. Also, be careful with the techniques/algorithms you will use. In prototype generation, for example, there are techniques that have a high performance on regular datasets, but if you use them with unbalanced datasets, they will misclassify most (or all) instances of the minority class.
Source:Here
I'm unlikely to have time to commit to this in the next couple of months.
Given how sparse the data is for the rarer languages (which is where the problem gets interesting), I think the choice of algorithm is the most important consideration here. Use any toolkit you want, but it might well be that "standard" algorithms are a poor fit in this case.
Hi @rishikksh20, do you have experience working with low-resource languages? If you have, I would certainly be interested in discussing further.
1) ROC curves are defined for binary classification tasks, so don't seem appropriate here, since we have thousands of classes. My first thought is that macro-averaged precision/recall/F1 would be the way to go.
2-3) Oversampling, Prototype Selection, etc. generally makes the assumption that we can linearly interpolate between data points, which may not be a good fit, depending on the features. Furthermore, for particularly rare classes, the observed samples may not be representative of the full distribution, and relying on oversampling would lead to a skewed classifier. At the end of the day, these methods are useful when trying to get a relatively simple classifier which works on balanced datasets to work on an imbalanced one - but I think a better approach in this case would be to use a model that can deal with imbalanced data directly.
4) Yes, that's good practice.
5) Ensemble learning doesn't help with imbalanced data if all of the individual classifiers fail in similar ways, so this doesn't solve the problem.
For the rare classes, our dataset reminds me much more of the literature on one-shot learning - and some recent work (e.g. Lake et al. (2015), Salakhutdinov et al. (2013)) suggests that deep generative models can be very effective. The basic idea is that you can use what you learn about the well represented classes to inform decisions about the rare classes. Essentially, you can share statistical strength across classes.
So now we have data, data interfaces and feature extractor and a naive bayes demo, we need to figure out the problem of unbalanced data. Any clues on how to proceed with this?
I've tried shogun ML suite... very interesting and ambitious aim but i think user-friendliness is still far from what i expect, possibly sklearn or weka is easier but somehow we have to use one of the suite so as to spend more time on figuring out the "science" of rebalancing features rather than the implementation of the algorithm =)