Open stevehadd opened 4 years ago
The standard way of encoding a categorical feature (with no natural ordering) is to use one hot encoding. One issue with this is when you have many different possible values, you need to have one column/feature per possible value. For cruise, there are thousands or tens of thousands of different cruise ID, and hundred of platforms and institutes. This makes one hot encoding impractical. An alternative is embededd encoding, when the category is represented as a vector. https://towardsdatascience.com/deep-embeddings-for-categorical-variables-cat2vec-b05c8ab63ac0 https://towardsdatascience.com/categorical-embedding-and-transfer-learning-dd3c4af6345d
This requires there being some measure of which values are "close" to one another. There may not be any sensible way of evaluating this for the XBT data, so we might not be able to use this approach.
The standard way of encoding a categorical feature (with no natural ordering) is to use one hot encoding. One issue with this is when you have many different possible values, you need to have one column/feature per possible value. For cruise, there are thousands or tens of thousands of different cruise ID, and hundred of platforms and institutes. This makes one hot encoding impractical. An alternative is embededd encoding, when the category is represented as a vector. https://towardsdatascience.com/deep-embeddings-for-categorical-variables-cat2vec-b05c8ab63ac0 https://towardsdatascience.com/categorical-embedding-and-transfer-learning-dd3c4af6345d
This requires there being some measure of which values are "close" to one another. There may not be any sensible way of evaluating this for the XBT data, so we might not be able to use this approach.