MetOffice / XBTs_classification

Project for the classification of eXpendable Bathy Thermographs
BSD 3-Clause "New" or "Revised" License
4 stars 2 forks source link

Investigate why by year splitting reduces performance (possible bug?) #69

Closed stevehadd closed 3 years ago

stevehadd commented 3 years ago

When using the sample_feature_values function to select cruises to use for final validation, it was thought a good idea to sample a fraction from each year, to ensure we good representation in the train and test sets from each year. This seems to cause dramaticx reduction inperformance. I think this means there is a bug, because there is a lot of custom code beyond standard pandas and scikit learn, so it seems likely that is causing the spplit to happen incorrectly and so poor results.

The line is experiment.py

ensemble_unseen_cruise_numbers = self.xbt_labelled.sample_feature_values(self.unseen_feature, fraction=self.ens_unseen_fraction, split_feature='year')

currently we have removed the split_feature argument until we can fix the bug.