Open omidkj opened 6 years ago
You have covered wide range of features, I am quite interested in the first feature you have proposed. Finding a way to figure out the weight of missing values and assessing them right off the bat and taking care of them before analyzing data would make the algorithm quite useful and worthwhile.
I like the idea of an automated way to find the right threshold for each dataset. However, sometimes there are some features in our datasets that are really important and we don't want to remove even with a lot of missing values. This proposed method only lists these features with the most missing values and after reviewing them and applying the domain knowledge, we can decide if we want to keep or remove them.
Finding features with most missing values: How do you plan to arrive at certain threshold value ? (0.45). I really like the idea of all the features you plan to implement especially this : Finding any features that have a single unique value.
It's hard to come up with the normalized importance value in my opinion as the importance values vary from negative to positive values and in some cases only positive values. So how do you plan to get the normalized value work for both the cases ?
Doesn't normalization account for negative and positive values, to place all values in the dataset between 0 and 1?
I think this a great idea. Maybe instead of setting a threshold, it would simply output the percentage of missing values and let the user choose from there? Or the user could input a threshold they would like based off of their knowledge of how much their specific data set would value that percentage of missing values. This could be a simple program that would perform all of the output we would need in terms of metrics on the variables. It would simplify the process of exploring each variable.
@rohitchadaram I believe choosing a certain threshold is the team decision based on the nature of the dataset they're working on. That's why this value is passed as a parameter and not automatically set in the program. As @douglas-yao mentioned the normalized importance value is between 0 and 1.
@haleyhowe That's exactly what this method is supposed to do: receive a percentage (threshold) and output a list of features with their percentage of missing values that are greater than the threshold we set earlier. Then, we can decide which feature(s) to eliminate and which ones to keep by applying the domain knowledge. Here we can create another method called 'remove_uf' that receives a list of feature names and the dataset and returns a new dataset with removed features.
I really like your idea, especially "finding highly correlated features". Can we use Spearman method also? Since Pearson is only work for linear relationship b/w variables (predictors).
@RoxanneXin That's a good plan!
The first step in the process of finding and selecting the most useful features in a dataset, is finding the unimportant features and remove them from dataset to increase training speed and model interpretability. For this tool we can develop some of the following methods:
Finding features with most missing values: This method receives a ndarray with a specified threshold (between 0 and 1) for missing values. For instance a threshold of 0.45 means find feature with more than 45% missing values.
Finding any features that have a single unique value: This method finds any features that have only a single value in their columns.
Finding highly correlated features: This method receives a ndarray with a specified threshold for the correlation coefficient value as parameters. It uses a Pearson correlation method and returns features that have correlation coefficient value more than specified value.
Find low important features based on Random forest importance result: This method receives a ndarray with a specified threshold for the normalized importance value (between 0 and 1). For instance a threshold of 0.45 means find feature with that their normalized importance value is below 0.45.