SeldonIO / alibi

Algorithms for explaining machine learning models
https://docs.seldon.io/projects/alibi/en/stable/
Other
2.4k stars 252 forks source link

Data with only numeric features - alternative to get_he_preprocessor() to avoid argument category_map #809

Open pranavn91 opened 2 years ago

pranavn91 commented 2 years ago

I am trying the tutorial given in https://docs.seldon.io/projects/alibi/en/stable/examples/cfrl_adult.html. However my data has only numeric values so getting stuck at below line as no categorical values to pass for argument category_map? Please suggest what can i do to solve this error?

heae_preprocessor, heae_inv_preprocessor = get_he_preprocessor(X=X_train,
                                                               feature_names=Xnew30.columns,
                                                               feature_types=feature_types)
TypeError                                 Traceback (most recent call last)
Input In [30], in <cell line: 2>()
      1 # Define data preprocessor and inverse preprocessor. The invers preprocessor include datatype conversions.
----> 2 heae_preprocessor, heae_inv_preprocessor = get_he_preprocessor(X=X_train,
      3                                                                feature_names=Xnew30.columns,
      4                                                                feature_types=feature_types)
      6 numerical_ids = np.arange(len(X_train.columns))
      8 # Define trainset

TypeError: get_he_preprocessor() missing 1 required positional argument: 'category_map'
mauicv commented 2 years ago

Hey @pranavn91,

Thanks for opening the issue.

To avoid categorical variables in the heae_preprocessor you can just set category_map={}. Note that you only need to set the feature_types dict if the corresponding features should be in a certain type. For instance, if your dataset includes features such as age then you might want them to be ints rather than floats. For the code below, I've dropped the categorical features for the Adult dataset and just kept the numerical ones. I want them all to be ints however so I'm using the following:

heae_preprocessor, heae_inv_preprocessor = get_he_preprocessor(X=X_train,
                                                               feature_names=['Age', 'Capital Gain', 'Capital Loss', 'Hours per week'],
                                                               category_map={},
                                                               feature_types={"Age": int, "Capital Gain": int, "Capital Loss": int, "Hours per week": int})

If however you're happy with your data just being floats you can use:

heae_preprocessor, heae_inv_preprocessor = get_he_preprocessor(X=X_train,
                                                               feature_names=['Age', 'Capital Gain', 'Capital Loss', 'Hours per week'],
                                                               category_map={},
                                                               feature_types={})

and in this case, the heae_encoder will just apply sklearn StandardScaler preprocessing to the data. In the future, we will make these parameters optional.

you can then define the training set as follows:

# Define trainset
trainset_input = heae_preprocessor(X_train).astype(np.float32)
trainset_outputs = {"output_1": trainset_input[:, :4]}
trainset = tf.data.Dataset.from_tensor_slices((trainset_input, trainset_outputs))
trainset = trainset.shuffle(1024).batch(128, drop_remainder=True)

The Encoder and Decoder defined in the example is for categorical data. You can use:

class Encoder(keras.Model):
    def __init__(self, hidden_dim: int, latent_dim: int, **kwargs):
        super().__init__(**kwargs)
        self.fc1 = keras.layers.Dense(hidden_dim)
        self.fc2 = keras.layers.Dense(latent_dim)

    def call(self, x: tf.Tensor, **kwargs) -> tf.Tensor:
        x = tf.nn.relu(self.fc1(x))
        x = tf.nn.tanh(self.fc2(x))
        return x

class Decoder(keras.Model):
    def __init__(self, hidden_dim: int, output_dim, **kwargs):
        super().__init__(**kwargs)

        self.fc1 = keras.layers.Dense(hidden_dim)
        self.fc2 = keras.layers.Dense(output_dim)

    def call(self, x: tf.Tensor, **kwargs) -> List[tf.Tensor]:
        x = tf.nn.relu(self.fc1(x))
        return self.fc2(x)

and then:

from alibi.models.tensorflow import AE

# Define autoencoder path and create dir if it doesn't exist.
ae_path = os.path.join("tensorflow", "autoencoder")
if not os.path.exists(ae_path):
    os.makedirs(ae_path)

# Define constants.
EPOCHS = 50              # epochs to train the autoencoder
HIDDEN_DIM = 128         # hidden dimension of the autoencoder
LATENT_DIM = 15          # define latent dimension

# Define the heterogeneous auto-encoder.
ae = AE(encoder=Encoder(hidden_dim=3, latent_dim=2),
        decoder=Decoder(hidden_dim=3, output_dim=4))

# Define loss functions.
he_loss = keras.losses.MeanSquaredError()

# Compile model.
ae.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-3),
             loss=he_loss)

if len(os.listdir(ae_path)) == 0:
    # Fit and save autoencoder.
    ae.fit(trainset, epochs=EPOCHS)
    ae.save(ae_path, save_format="tf")
else:
    # Load the model.
    ae = keras.models.load_model(ae_path, compile=False)
mauicv commented 2 years ago

Note also that for datasets with a small number of numerical features you might not even want to use the autoencoder as it's only there for dimensionality reduction. See this example as to how to do this.

pranavn91 commented 2 years ago

Thanks. I used the solution suggested and it is working without errors.

pranavn91 commented 2 years ago

So if no immutable/categorical data nor range is given- we can write as below?

# Define constants
COEFF_SPARSITY = 0.5               # sparisty coefficient
COEFF_CONSISTENCY = 0.5            # consisteny coefficient
TRAIN_STEPS = 10000                # number of training steps -> consider increasing the number of steps
BATCH_SIZE = 100                   # batch size

explainer = CounterfactualRLTabular(predictor=predictor,
                                    encoder=ae.encoder,
                                    decoder=ae.decoder,
                                    latent_dim=2,
                                    encoder_preprocessor=heae_preprocessor,
                                    decoder_inv_preprocessor=heae_inv_preprocessor,
                                    coeff_sparsity=COEFF_SPARSITY,
                                    coeff_consistency=COEFF_CONSISTENCY,
                                    category_map={},
                                    feature_names=X.columns,                                    
                                    train_steps=TRAIN_STEPS,
                                    batch_size=BATCH_SIZE,
                                    backend="tensorflow")
mauicv commented 2 years ago

Hey,

Did you try running it? What happened?

I think i've made a minor mistake in the above, the Decoder model should return a list of tensors like so:

class Decoder(keras.Model):
    def __init__(self, hidden_dim: int, output_dim, **kwargs):
        super().__init__(**kwargs)

        self.fc1 = keras.layers.Dense(hidden_dim)
        self.fc2 = keras.layers.Dense(output_dim)

    def call(self, x: tf.Tensor, **kwargs) -> List[tf.Tensor]:
        x = tf.nn.relu(self.fc1(x))
        return [self.fc2(x)]

this also means that when you train the autoencoder you'll need to use a list of losses:

# Compile model.
ae.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-3),
                   loss=[he_loss])
pranavn91 commented 2 years ago

yes it is working thanks.

From my understanding the example given https://docs.seldon.io/projects/alibi/en/stable/examples/cfrl_adult.html.

  1. The quality of counterfactuals generated depends on the black box model - in the above link randomforest should be as accurate as possible

  2. And the encoder-decoder loss as low as possible

Am i correct?

mauicv commented 2 years ago

Hey @pranavn91,

The quality of counterfactuals generated depends on the black box model - in the above link randomforest should be as accurate as possible

It depends on what you're using the counterfactuals for. If your using the counterfactual to debug the model then it doesn't matter how good the model is the counterfactual can still be useful for understanding how the model is failing (Although if you're doing this be careful).

Alternatively, if you're using the counterfactual to add functionality to the model then you'd want the model to be as accurate as possible. So as an example maybe you have a model that predicts the risk of some disease with some set of features (things like age or hours exercise per week). One use of a counterfactual might be to advise a user how to change their features to get a better outcome. So if I'm a user and the model says I'm likely to get this disease then the counterfactual would tell me how to change my behaviour to not get the disease. This is dependent on the model being accurate however so in this case accuracy is key.

And the encoder-decoder loss as low as possible

The autoencoder is a dimensionality reduction step that makes the DDPG algorithm that the method is based on faster to train. We train the actor in the Latent space, so yes because we have to reconstruct the data using the decoder it's important that the autoencoder is well-trained.

Berlyli866 commented 1 week ago

Hi, if i have binary(only value 0 ,1 ) and numerical features, and i want to use get_he_preprocessor, do i need to provide binary feature in category_map or since its already one hot encoded I should leave category_map={} ? should i use

heae_preprocessor, heae_inv_preprocessor = get_he_preprocessor(X=X_train,
                                                               feature_names=['has_image', 'num_images', 'has_video', 'price'],
                                                               category_map={'has_image':[0,1],'has_video':[0,1]},
                                                              feature_types={'has_image':int, 'num_images':int, 'has_video':int})

or

heae_preprocessor, heae_inv_preprocessor = get_he_preprocessor(X=X_train,
                                                               feature_names=['has_image', 'num_images', 'has_video', 'price'],
                                                               category_map={},
                                                               feature_types={'has_image':int, 'num_images':int, 'has_video':int})

For the encoder and decoder for binary features, can I follow the step how Encoder and Decoder were defined in the example(adlute census) for categorical data ? since binary features can be considered as categorial features?

thanks for your help and insights !