Sales-Choice-Volunteering-Project / EmotionAnalyzerWeka

The program for obtaining emotion data
1 stars 0 forks source link

Research on filtering unbalanced data #35

Closed sherlockliang888 closed 3 years ago

sherlockliang888 commented 3 years ago
  1. Research and implement ROC, AUC on testing results
  2. Research on how to filter unbalanced data
sherlockliang888 commented 3 years ago

With under-sampling, we randomly select a subset of samples from the class with more instances to match the number of samples coming from each class. The main disadvantage of under-sampling is that we lose potentially relevant information from the left-out samples. weka.filters.supervised.instance.SpreadSubsample

With oversampling, we randomly duplicate samples from the class with fewer instances or we generate additional instances based on the data that we have, so as to match the number of samples in each class. While we avoid losing information with this approach, we also run the risk of overfitting our model as we are more likely to get the same samples in the training and in the test data. weka.filters.supervised.instance.SMOTE

sherlockliang888 commented 3 years ago

When AUC=0.5, then the classifier is not able to distinguish between Positive and Negative class points. Meaning either the classifier is predicting random class or constant class for all the data points.