MetOffice / XBTs_classification

Project for the classification of eXpendable Bathy Thermographs
BSD 3-Clause "New" or "Revised" License
4 stars 2 forks source link

Try embedded encoding for cruise, platform features #28

Open stevehadd opened 4 years ago

stevehadd commented 4 years ago

The standard way of encoding a categorical feature (with no natural ordering) is to use one hot encoding. One issue with this is when you have many different possible values, you need to have one column/feature per possible value. For cruise, there are thousands or tens of thousands of different cruise ID, and hundred of platforms and institutes. This makes one hot encoding impractical. An alternative is embededd encoding, when the category is represented as a vector. https://towardsdatascience.com/deep-embeddings-for-categorical-variables-cat2vec-b05c8ab63ac0 https://towardsdatascience.com/categorical-embedding-and-transfer-learning-dd3c4af6345d

This requires there being some measure of which values are "close" to one another. There may not be any sensible way of evaluating this for the XBT data, so we might not be able to use this approach.