Feature Proposal of Feature Selection/Extraction

douglas-yao commented 6 years ago

Feature selection is a very important part of analyzing a dataset. Weeding out unnecessary noise in a dataset will greatly improve the model that is ultimately built around it. I propose a feature focused on feature-selection/extraction, beginning with more inclusive methods (as to not eliminate important features of a dataset) and leading into carefully eliminating unnecessary / uncorrelated variables. This will be a series of functions comprising a feature extraction module.

From a feature that cleans up data, this feature will read in an np array, starting with inclusive / ranking methods such as decision trees, and proceed to more exclusionary methods such as PCA and / or recursive feature elimination. Output will be a list of ranked features to be included in the model, as well as a list of excluded features.

Most tools / functions / modules used will derive from the scikit-learn package.

omidkj commented 6 years ago

how do you want to apply this: "more inclusive methods (as to not eliminate important features of a dataset)" How do you want to decide that one feature is important without being able to apply the domain knowledge?

rohitchadaram commented 6 years ago

The important challenge in this proposal is the need to understand the type of data : If it is mixed data(numerical and categorical), only categorical or only numerical and how would you deal that the feature selection works for all these cases.

douglas-yao commented 6 years ago

Omid has a good point about involving domain knowledge. I suppose it depends on how automated the feature selection will be. Rohit I agree with the challenge involving type of data. I see feature / model selection to be based on many conditionals, but this could get a bit verbose.

haleyhowe commented 6 years ago

I agree with Omid, there must be some sort of domain knowledge involved in this. Maybe involving the user in the process? Prompting the user throughout the process may be helpful as you could get a more accurate result? I think to set up one main program to do all this would ultimately be awesome, however there are some crucial decision making throughout the process that should not be ignored.

douglas-yao commented 5 years ago

Originally, this idea was meant for more than just linear regression. Since it looks like many important items are covered directly related to linear regression, I'm pivoting my feature to give the user an overview of the original dataset, using cluster (to see understand potential grouping of the data) and random forest (to provide a ranking of the features' importance). The more we can understand the original dataset, the more we can discern if our model is potentially over or underfit. I think a combination of random forest and clustering is a bit weird.. so perhaps I'll stick to clustering, but we'll see how it pans out.

EHWUSF / HS68_2018_Project_1

Feature Proposal of Feature Selection/Extraction #4