investigate discrepancy between training and validation set errors

jsherrah commented 10 years ago

on grid search it's 0.3, which is quite large. Fundamentally the problem is the distributions of features for the two data sets are too dissimilar. This can be caused by the data sets being too different, or the features being calculated differently for the two sets.

Anthony could you please investigate this? Perhaps start by listing and eyeballing the images in the training and validation sets. (we did separate them by image, didn't we?)

jsherrah commented 10 years ago

There are other possibilities:

classifier is overfitting during training. Unlikely since validation set error is minimised in grid search. Of course this could be broken...
classifier is not applied correctly to validation set data. For example if there is normalisation that is being skipped or done incorrectly in the context of that data set.

jsherrah commented 10 years ago

I have investigated, and my conclusion is that these features are not good enough. Here's the details. The results below are using HSV colour and textons.

using showFeatures.py I looked at the feature distributions for training and validation, and they look about the same. No degeneracies anyway.
Reversing the roles of the data sets (ie train on validation, test on training) gives similar accuracies on each set, and a similar discrepancy in generalisation. So the problem is not that validation features are being computed incorrectly, for example.
Doing some googling, it is a no-no to report the training set error for a random forest. On the training data it looks like it has overfit, so the accuracy is over-estimated. Instead, the classifier produces an "out-of-bag" estimate and this is what should be reported. This accuracy is much closer to the validation error:
- training accuracy: 94.93%
- training oob accuracy: 76.56%
- validation accuracy: 64.92%
It turns out the class average accuracy is really terrible, on the validation set the per-pixel accuracy is 64.92%, the class average is 43.00%. In the literature this gap should be more like 10%, not 20%. The accuracy per class on the validation set is:

   - average accuracy per class =  0.430049881425
      building: 0.666282
      grass: 0.926347
      tree: 0.710614
      cow: 0.635071
      sheep: 0.297214
      sky: 0.944312
      aeroplane: 0.070000
      water: 0.515385
      face: 0.624454
      car: 0.397980
      bicycle: 0.515000
      flower: 0.586275
      sign: 0.287709
      bird: 0.000000
      book: 0.489540
      chair: 0.000000
      road: 0.683119
      cat: 0.212598
      dog: 0.226244
      body: 0.163539
      boat: 0.079365

Note bird and chair are 0! Is the data dodgy for these examples? Are there too few examples for the classes? Or are they just hard?

To answer the second question, the class proportions vary wildly. In the training set:

   - class proportions in Training set:
             building: 0.113521 (  7596 examples)
                grass: 0.189500 ( 12680 examples)
                 tree: 0.075202 (  5032 examples)
                  cow: 0.032654 (  2185 examples)
                sheep: 0.022880 (  1531 examples)
                  sky: 0.099562 (  6662 examples)
            aeroplane: 0.017276 (  1156 examples)
                water: 0.086172 (  5766 examples)
                 face: 0.019368 (  1296 examples)
                  car: 0.035853 (  2399 examples)
              bicycle: 0.026916 (  1801 examples)
               flower: 0.024704 (  1653 examples)
                 sign: 0.020982 (  1404 examples)
                 bird: 0.013674 (   915 examples)
                 book: 0.052411 (  3507 examples)
                chair: 0.018023 (  1206 examples)
                 road: 0.092882 (  6215 examples)
                  cat: 0.016335 (  1093 examples)
                  dog: 0.014287 (   956 examples)
                 body: 0.020579 (  1377 examples)
                 boat: 0.007218 (   483 examples)

Bird and chair are among the least represented classes. Still, there are many examples for bird and chair.

scikit-learn also says decision trees do not handle unbalanced class distributions well. I tried dividing the estimated class probabilities by the prior probabilities shown above, this helped things somewhat (48% class average instead of 43%). The pixel accuracy reduced to 59.64%. However it is not a proper treatment. The class distributions should be even during training, or a cost matrix introduced. scikit-learn doesn't have obvious controls for this for random forests (for example when selecting a random sample of data, an even class distribution could be maintained).

jsherrah commented 10 years ago

I don't have a concrete answer, but here is my hunch based on the above:

Spatial distributions of clustered features should be used as in TextonBoost to get the accuracy up to scratch.
A classifier should be used that can handle uneven class distributions, as in TextonBoost (sample weights can be applied). So I think we have pushed our simple implementation as far as it will go, time to man up and go for gold with more complex features and inference.

RockStarCoders / alienMarkovNetworks

investigate discrepancy between training and validation set errors #31