Create embeddings for categorical variables

malctaylor15 commented 6 years ago

Create embeddings from some of the one hot encoded variables

Embeddings can be of size min(50, number_cat_vars//2)

Can produce an embeddings for the variables -- ['Airport Code', 'Airport Name', 'Airline Name', 'Claim Type', 'Claim Site', 'Item Category', 'Disposition']

Keras has an implementation of embedding layers
https://keras.io/layers/embeddings/

I believe we can make some using fastai but not sure exactly how yet. There are some notes here ... https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb

Options: Maybe start with one some what small vector like claim type or reduce size of embedding to see if it yields results. Can train the embedding by itself for one iteration. Next step would be to train several embeddings at the same time

Goal:

Create embedding for categorical variables, trained on close amount, Neural network architecture simple to more complicated (1 FC, 3FC, etc?)

Extra:

Validate embeddings by checking for some correlations of average amount and seeing if the vectors that are close to each other correspond to anything without the extra variables
Export embeddings matrix for use in other algorithms (and data cleaning)

More Extra:

Use TSNEE to visualize the embeddings in 2D space

cheetahbright commented 6 years ago

@malctaylor15 Are you referring to these sort of embeddings?

malctaylor15 commented 6 years ago

I was referring to a the machine learning embedding rather than the python version in the link. The idea of embeddings are more frequent and used more commonly with natural language texts and word2vec. The training of it isn't as similar (there are no contexts in this case) but we want to change the one hot representation into a rich vector with a size of our choice.

This is a link that is more focused on word2vec. I will keep looking for better videos but for now ... here

Keep in mind that we are only trying to predict the close amount (which is a continuous variable) and that he raw variables are one hot encoded in a similar way that words are. What we are trying to extract is the embedding matrix (the thing that starts out as random but gets better as we train it). The size of the matrix is (number of rows = #of one hot variables ) x (number of columns = # of embeddings)

malctaylor15 commented 6 years ago

In this link here , they begin playing with vectors. In this case the words would be different levels of the categorical variables. I will be incorporating some of the functions they used. Some slight modifications are needed for the change from word to generalized tokens.

For example, we might want to find which airlines are the most similar (and dissimilar) to Jet Blue.

We might also might want to see which airlines have similar relationships according to our data. So American Airlines is to Delta as China Southern Airlines is to which airline (China Eastern Airline?).

We will also have to keep in mind that this data is about the lost claims so I am not sure what kinds of correlations will appear. These embeddings were trained using the claim amount of lost items and the airline.

cheetahbright / tsa-decision-trees

Create embeddings for categorical variables #3