arsine1996 commented 4 years ago

Does the model supports categorical feature types for lgbm? I got an error when running with specified categorical features.

MarcelRobeer commented 4 years ago

It should be able to work with categorical features. Do you have a minimal working example to reproduce your error?

arsine1996 commented 4 years ago

yes sure, I appreciate a lot your assistance; here is the sample from my data, I run simple lightgbm model and ` Age	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
29	Travel_Rarely	592	Research & Development	7	3	Life Sciences	1	1883	4	Male	59	3	1	Laboratory Technician	1	Single	2062	19384	3	Y	No	14	3	2	80	0	11	2	3	3	2	1	2
36	Travel_Rarely	884	Sales	1	4	Life Sciences	1	1585	2	Female	73	3	2	Sales Executive	3	Single	6815	21447	6	Y	No	13	3	1	80	0	15	5	3	1	0	0	0
34	Travel_Rarely	1326	Sales	3	3	Other	1	1478	4

make data categorical

X[X.select_dtypes(include="object").columns.tolist()] = X.select_dtypes(include="object").astype('category')

X0, X1, Y0, Y1 = train_test_split(X, Y, test_size=0.25, random_state=42)

fit model

model=LGBMClassifier(random_state=42, max_depth=2, n_estimators=200, boosting_type='dart') model.fit(X0, Y0)

dm = ce.domain_mappers.DomainMapperTabular(X0.values, feature_names=X0.columns.tolist(), contrast_names=['0','1'], seed=42) exp = ce.ContrastiveExplanation(dm, verbose=True ) ` and got following error _ValueError Traceback (most recent call last)

in () 1 dm = ce.domain_mappers.DomainMapperTabular(X0.values, 2 feature_names=X0.columns.tolist(), ----> 3 contrast_names=['0','1'], seed=42) 4 exp = ce.ContrastiveExplanation(dm, verbose=True ) 5 frames /usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 83 84 """ ---> 85 return array(a, dtype, copy=False, order=order) 86 87 ValueError: could not convert string to float: 'Travel_Rarely'_ Will appreciate your help for solving this

MarcelRobeer commented 3 years ago

ContrastiveExplanation is unable to automatically infer what the categorical columns in your data are, except when the data is a Pandas Dataframe. You should either specify the names/indices of the categorical variables for a DomainMapperTabular with the categorical_features argument, or you can try replacing the DomainMapperTabular with a DomainMapperPandas (which automatically infers the feature names as well as which of them are categorical).

arsine1996 commented 3 years ago

Thanks for the suggestion, I tried to add the cat columns but still it didn't work.

make data categorical

X[X.select_dtypes(include="object").columns.tolist()] = X.select_dtypes(include="object").astype('category')
X0, X1, Y0, Y1 = train_test_split(X, Y, test_size=0.25, random_state=42)

cat_cols = X.select_dtypes(include="object").columns.tolist()
encoder = category_encoders.OrdinalEncoder(cols=cat_name)
encoder.fit(X0, Y0)

X0_encoded = encoder.transform(X0)
X1_encoded = encoder.transform(X1)

model=LGBMClassifier(random_state=42, boosting_type='dart') 
model.fit(X0_encoded, Y0, categorical_feature=cat_name)

sample = X0_encoded.iloc[0,:]

dm = ce.domain_mappers.DomainMapperTabular(X0_encoded.values,  feature_names=X0_encoded.columns.tolist(), 
                                        contrast_names=['0','1'], seed=42, categorical_features=cat_name)
exp = ce.ContrastiveExplanation(dm, verbose=True,  seed=42)

dexError Traceback (most recent call last)

in () 1 dm = ce.domain_mappers.DomainMapperTabular(X0_encoded.values, feature_names=X0_encoded.columns.tolist(), ----> 2 contrast_names=['0','1'], seed=42, categorical_features=cat_name) 3 exp = ce.ContrastiveExplanation(dm, verbose=True, seed=42) 1 frames /content/ContrastiveExplanation/contrastive_explanation/domain_mappers.py in _one_hot_encode(self, data) 239 self.unique_vals = dict() 240 for column in self.categorical_features: --> 241 self.unique_vals[column] = set(data[:, column]) 242 243 # Create encoders IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

MarcelRobeer commented 3 years ago

The previous version had the assumption that the categorical feature names were indices (0, 1, 5, etc.) instead of names of features ('BusinessTravel', 'Department'). This should be fixed now.

cat_cols = X.select_dtypes(include=['category', 'object']).columns.tolist()
dm = ce.domain_mappers.DomainMapperTabular(X0.values,
                                           feature_names=X0.columns.tolist(),
                                           contrast_names=['0','1'],
                                           seed=42,
                                           categorical_features=cat_name)

should work as well as

dm = ce.domain_mappers.DomainMapperPandas(X0, contrast_names=['0', '1'], seed=42)

(the latter automatically infers feature names and categorical names for you)

MarcelRobeer / ContrastiveExplanation

lightgbm categorical feature support #8

make data categorical

fit model

make data categorical