TuxML / size-analysis

Analysis of 125+ Linux configurations (this time for predicting/understanding kernel sizes)
2 stars 1 forks source link

Categorical encoding #11

Open FAMILIAR-project opened 5 years ago

FAMILIAR-project commented 5 years ago

Some options are tristate (n, y, m) and we need a strategy to encode their values. As discussed, our current solution (0, 1, 2) has limitations for some learning algorithms.

We could try some strategies like one-hot encoding https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html http://contrib.scikit-learn.org/categorical-encoding/index.html or dummies: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html dummyfication: https://en.wikiversity.org/wiki/Dummy_variable_(statistics)

but there are some subtilities to think about: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn

In fact, we need to think about these subtilities for all kinds of algorithms. The encoding can be different depending on the use of linear regression or neural networks. I like the reading of https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features that recommends to use drop when linear regression is employed.

Another appealing idea of @llesoil is to consider 'm' is similar to 'n' wrt size (basically 'm' does not have effect on kernel size)