eubr-bigsea / citrus

Apache License 2.0
2 stars 2 forks source link

Add Chi-Square Feature Selection #133

Open alexgcsa opened 5 years ago

alexgcsa commented 5 years ago

Hi,

We must have a feature selection that is not manual. Gisele recommended this one:

https://spark.apache.org/docs/2.2.0/ml-features.html#chisqselector

The issue (mentioned by @waltersf ) is that this feature selection method only accepts numerical features.

An alternative is to output an error if the feature selection method receives categorical features, recommending to transform them to numerical features:

"If your features represents words into a text, try to use the Count term frequency operation (with the type Count term frequency or Map the sequence of terms to their TF using hashing). Otherwise, i.e., if your features does not represent words into a text, try to use the One-hot encoder operation".

waltersf commented 5 years ago

We need working example using Spark or scikit-learn.

alexgcsa commented 5 years ago

Hi,

Would be the following example enough?


Example (scikit-learn):

import sklearn from sklearn.datasets import load_digits from sklearn.feature_selection import SelectKBest, chi2

X, y = load_digits(return_X_y=True) X.shape

X_new = SelectKBest(chi2, k=20).fit_transform(X, y) X_new.shape

Source: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

alexgcsa commented 5 years ago

I will try to use the dataset scene in my example:

scene.zip

It has 299 attributes, 2407 instances and two classes (binary: 0 for nor urban / 1 urban). It can be found on OpenML as well:

https://www.openml.org/d/312

alexgcsa commented 5 years ago

Example (from scikit-learn) using the dataset scene:


import numpy as np import pandas as pd import sklearn from sklearn.feature_selection import SelectKBest, chi2

input_file = "scene.csv" dataset = pd.read_csv(input_file, header = 0)

X = dataset.loc[:, dataset.columns != 'class'] y = dataset['class'] X.shape

X_new = SelectKBest(chi2, k=20).fit_transform(X, y) X_new.shape

alexgcsa commented 5 years ago

@waltersf @zilton