Palashio / libra

Ergonomic machine learning for everyone.
http://libradocs.org/
MIT License
1.92k stars 109 forks source link

Implement rough MCA idea for reducing dimensionality of one hot columns #147

Closed jbofill10 closed 4 years ago

jbofill10 commented 4 years ago

I left it to 6 because I noticed that there is a flaw in the idea I had of tracking the categorical columns. Instead, we should run a search through those columns unless it's an already large amount, and check the unique values of each of those columns to better understand how the dimensions will grow after one hot encoding. As for the components, it is not the same as PCA where you can select the variance, so really the only way to find optimal n for components is to do a grid search comparing scores from train and validation. If this is something we want to pursue, I can add those features in as well. As for now, this is a rough idea and I'll close the issue now or later depending on if we want to add none/some/all of these ideas.

jbofill10 commented 4 years ago

Updated with PCA query issues being fixed