MarcelRobeer / ContrastiveExplanation

Contrastive Explanation (Foil Trees), developed at TNO/Utrecht University
BSD 3-Clause "New" or "Revised" License
45 stars 5 forks source link

lightgbm categorical feature support #8

Closed arsine1996 closed 3 years ago

arsine1996 commented 4 years ago

Does the model supports categorical feature types for lgbm? I got an error when running with specified categorical features.

MarcelRobeer commented 4 years ago

It should be able to work with categorical features. Do you have a minimal working example to reproduce your error?

arsine1996 commented 4 years ago
yes sure, I appreciate a lot your assistance; here is the sample from my data, I run simple lightgbm model and ` Age BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
29 Travel_Rarely 592 Research & Development 7 3 Life Sciences 1 1883 4 Male 59 3 1 Laboratory Technician 1 Single 2062 19384 3 Y No 14 3 2 80 0 11 2 3 3 2 1 2
36 Travel_Rarely 884 Sales 1 4 Life Sciences 1 1585 2 Female 73 3 2 Sales Executive 3 Single 6815 21447 6 Y No 13 3 1 80 0 15 5 3 1 0 0 0
34 Travel_Rarely 1326 Sales 3 3 Other 1 1478 4

make data categorical

X[X.select_dtypes(include="object").columns.tolist()] = X.select_dtypes(include="object").astype('category')

X0, X1, Y0, Y1 = train_test_split(X, Y, test_size=0.25, random_state=42)

fit model

model=LGBMClassifier(random_state=42, max_depth=2, n_estimators=200, boosting_type='dart') model.fit(X0, Y0)

dm = ce.domain_mappers.DomainMapperTabular(X0.values, feature_names=X0.columns.tolist(), contrast_names=['0','1'], seed=42) exp = ce.ContrastiveExplanation(dm, verbose=True ) ` and got following error _ValueError Traceback (most recent call last)

in () 1 dm = ce.domain_mappers.DomainMapperTabular(X0.values, 2 feature_names=X0.columns.tolist(), ----> 3 contrast_names=['0','1'], seed=42) 4 exp = ce.ContrastiveExplanation(dm, verbose=True ) 5 frames /usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 83 84 """ ---> 85 return array(a, dtype, copy=False, order=order) 86 87 ValueError: could not convert string to float: 'Travel_Rarely'_ Will appreciate your help for solving this
MarcelRobeer commented 3 years ago

ContrastiveExplanation is unable to automatically infer what the categorical columns in your data are, except when the data is a Pandas Dataframe. You should either specify the names/indices of the categorical variables for a DomainMapperTabular with the categorical_features argument, or you can try replacing the DomainMapperTabular with a DomainMapperPandas (which automatically infers the feature names as well as which of them are categorical).

arsine1996 commented 3 years ago

Thanks for the suggestion, I tried to add the cat columns but still it didn't work.

make data categorical

X[X.select_dtypes(include="object").columns.tolist()] = X.select_dtypes(include="object").astype('category')
X0, X1, Y0, Y1 = train_test_split(X, Y, test_size=0.25, random_state=42)

cat_cols = X.select_dtypes(include="object").columns.tolist()
encoder = category_encoders.OrdinalEncoder(cols=cat_name)
encoder.fit(X0, Y0)

X0_encoded = encoder.transform(X0)
X1_encoded = encoder.transform(X1)

model=LGBMClassifier(random_state=42, boosting_type='dart') 
model.fit(X0_encoded, Y0, categorical_feature=cat_name)

sample = X0_encoded.iloc[0,:]

dm = ce.domain_mappers.DomainMapperTabular(X0_encoded.values,  feature_names=X0_encoded.columns.tolist(), 
                                        contrast_names=['0','1'], seed=42, categorical_features=cat_name)
exp = ce.ContrastiveExplanation(dm, verbose=True,  seed=42)

dexError Traceback (most recent call last)

in () 1 dm = ce.domain_mappers.DomainMapperTabular(X0_encoded.values, feature_names=X0_encoded.columns.tolist(), ----> 2 contrast_names=['0','1'], seed=42, categorical_features=cat_name) 3 exp = ce.ContrastiveExplanation(dm, verbose=True, seed=42) 1 frames /content/ContrastiveExplanation/contrastive_explanation/domain_mappers.py in _one_hot_encode(self, data) 239 self.unique_vals = dict() 240 for column in self.categorical_features: --> 241 self.unique_vals[column] = set(data[:, column]) 242 243 # Create encoders IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
MarcelRobeer commented 3 years ago

The previous version had the assumption that the categorical feature names were indices (0, 1, 5, etc.) instead of names of features ('BusinessTravel', 'Department'). This should be fixed now.

cat_cols = X.select_dtypes(include=['category', 'object']).columns.tolist()
dm = ce.domain_mappers.DomainMapperTabular(X0.values,
                                           feature_names=X0.columns.tolist(),
                                           contrast_names=['0','1'],
                                           seed=42,
                                           categorical_features=cat_name)

should work as well as

dm = ce.domain_mappers.DomainMapperPandas(X0, contrast_names=['0', '1'], seed=42)

(the latter automatically infers feature names and categorical names for you)