I left it to 6 because I noticed that there is a flaw in the idea I had of tracking the categorical columns. Instead, we should run a search through those columns unless it's an already large amount, and check the unique values of each of those columns to better understand how the dimensions will grow after one hot encoding. As for the components, it is not the same as PCA where you can select the variance, so really the only way to find optimal n for components is to do a grid search comparing scores from train and validation. If this is something we want to pursue, I can add those features in as well. As for now, this is a rough idea and I'll close the issue now or later depending on if we want to add none/some/all of these ideas.
I left it to 6 because I noticed that there is a flaw in the idea I had of tracking the categorical columns. Instead, we should run a search through those columns unless it's an already large amount, and check the unique values of each of those columns to better understand how the dimensions will grow after one hot encoding. As for the components, it is not the same as PCA where you can select the variance, so really the only way to find optimal n for components is to do a grid search comparing scores from train and validation. If this is something we want to pursue, I can add those features in as well. As for now, this is a rough idea and I'll close the issue now or later depending on if we want to add none/some/all of these ideas.