Open alexgcsa opened 5 years ago
We need working example using Spark or scikit-learn.
Hi,
Would be the following example enough?
Example (scikit-learn):
import sklearn from sklearn.datasets import load_digits from sklearn.feature_selection import SelectKBest, chi2
X, y = load_digits(return_X_y=True) X.shape
X_new = SelectKBest(chi2, k=20).fit_transform(X, y) X_new.shape
Source: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
I will try to use the dataset scene in my example:
It has 299 attributes, 2407 instances and two classes (binary: 0 for nor urban / 1 urban). It can be found on OpenML as well:
Example (from scikit-learn) using the dataset scene:
import numpy as np import pandas as pd import sklearn from sklearn.feature_selection import SelectKBest, chi2
input_file = "scene.csv" dataset = pd.read_csv(input_file, header = 0)
X = dataset.loc[:, dataset.columns != 'class'] y = dataset['class'] X.shape
X_new = SelectKBest(chi2, k=20).fit_transform(X, y) X_new.shape
@waltersf @zilton
Hi,
We must have a feature selection that is not manual. Gisele recommended this one:
https://spark.apache.org/docs/2.2.0/ml-features.html#chisqselector
The issue (mentioned by @waltersf ) is that this feature selection method only accepts numerical features.
An alternative is to output an error if the feature selection method receives categorical features, recommending to transform them to numerical features:
"If your features represents words into a text, try to use the Count term frequency operation (with the type Count term frequency or Map the sequence of terms to their TF using hashing). Otherwise, i.e., if your features does not represent words into a text, try to use the One-hot encoder operation".