Additional analysis for paper

MetOffice / XBTs_classification

Project for the classification of eXpendable Bathy Thermographs

BSD 3-Clause "New" or "Revised" License

4 stars 2 forks source link

Additional analysis for paper #103

Closed stevehadd closed 3 years ago

stevehadd commented 3 years ago

Based on discussion with Francesco, here are some additional things we could try for additional analysis to add to the paper to make a stronger case for how we have chosen to do things:

cal;culate permutation importance to show which features are useful
include balanced accuracy and some other metrics that are better suited to an imbalanced class problem than recall
calculate some confusion matrices to shed more light on performance for different classes
do some bootstrapping to get better performance for small classes through oversampling in the train set.

stevehadd commented 3 years ago

scikit learn bootstapping as part of cross validation: https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/generated/sklearn.cross_validation.Bootstrap.html

alternatively, the pandas sample method could be used for bootstrapping. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html

We could use this to even out class support for a variety of features:

instrument type
country
year

stevehadd commented 3 years ago

Some more ways to look at links between features https://datascience.stackexchange.com/questions/893/how-to-get-correlation-between-two-categorical-variable-and-a-categorical-variab

stevehadd commented 3 years ago

various ways to do feature selection more systematically using sklearn or pandas:

stevehadd commented 3 years ago

Permutation feature importance algorithm implementation: https://scikit-learn.org/stable/modules/permutation_importance.html

stevehadd commented 3 years ago

Docs for more metrics:

stevehadd commented 3 years ago

This has been implemeted in various notebooks and updated for the batch code in PR #111.