Closed raam93 closed 5 years ago
I have added initial support for multi-valued categorical features (mapping to a one-hot encoder and back in DomainMapperTabular). Typically this is already done as it is required by the predictor, so could you please indicate which package you are using to directly get predictions for categorical features?
Thanks for the reply!
I worked with your updated library - now I get outputs like this:
"The model predicted '>50k' instead of '<=50k' because '34 > 0.954 and 59 <= 4993.447 and 79 <= 0.046'"
So I remove 'fnlwgt' and 'education-num' features from adult income data and label encode the data and feed to your library.
df = pd.read_csv('adult_income.csv')
del df['fnlwgt']
del df['education-num']
df_le, label_encoder = label_encode(df, discrete) # discrete is discrete feature names
X = df_le.loc[:, df_le.columns != class_name].values # class_name is 'class'
y = df_le[class_name].values
'X' looks like this:
array([[39, 6, 9, ..., 0, 40, 38], [50, 5, 9, ..., 0, 13, 38], [38, 3, 11, ..., 0, 40, 38], ..., [58, 3, 11, ..., 0, 40, 38], [22, 3, 11, ..., 0, 20, 38], [52, 4, 11, ..., 0, 40, 38]], dtype=int64)
Then, after training I follow your code:
sample = x_test[17]
# Create a domain mapper (map the explanation to meaningful labels for explanation)
dm = ce.domain_mappers.DomainMapperTabular(x_train,
feature_names=np.array(['age',
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country']),
contrast_names=np.array(['<=50k', '>50k']),
categorical_features=np.array([1,2,3,4,5,6,7,11]))
# Create the contrastive explanation object (default is a Foil Tree explanator)
exp = ce.ContrastiveExplanation(dm)
# Explain the instance (sample) for the given model
exp.explain_instance_domain(model.predict_proba, sample)
Can you try your code on the adult income dataset or any other dataset with multi-valued categorical features? Thanks in advance!
I added your case as example number 2 to the example notebook.
Does the current implementation support only binary-valued categorical features?
Because I tried with the adult income dataset which has many multi-value categorical and continuous features (https://archive.ics.uci.edu/ml/datasets/adult) and got output like these:
Here, education and occupation are not binary features - they have many levels.