Using Anchor Tabular with Raw Data

HeyItsBethany3 commented 2 years ago

Hi all,

I'm trying to use Anchor Tabular on a new model, but I cannot fit the anchor. The model works with missing data so initially I've been using missing data to train the explainer. I've also tried replacing all the nan values with valid data but this still gives the same error. I've used your category map function to create the map. Note the data hasn't been ordinally/one hot encoded (for categorical data) or standardised (for the numerical data), as the model to explain uses raw data.

Thanks so much for your help!

Code:

explainer = AnchorTabular(predict_fn, feature_names=features, categorical_names=category_map, seed=1)
explainer.fit(data.xTrain, disc_perc=[25, 50, 75])

Traceback (most recent call last):
 File "pipeline.py", line 20, in <module>
    explainer.fit(data.xTrain, disc_perc=[25, 50, 75])
 File "....environment/lib/python3.8/site-packages/alibi/explainers/anchor_tabular.py", line 769, in fit
    disc = Discretizer(train_data, self.numerical_features, self.feature_names, percentiles=disc_perc)
File "....environment/lib/python3.8/site-packages/alibi/utils/discretizer.py", line 30, in __init__
    bins = self.bins(data)
  File "....environment/lib/python3.8/site-packages/alibi/utils/discretizer.py", line 88, in bins
    qts = np.array(np.percentile(data[:, feature], self.percentiles))
  File "....environment/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "....environment/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '(slice(None, None, None), 0)' is an invalid key

mauicv commented 2 years ago

Hey @HeyItsBethany3, Thanks for opening the issue. Can you share more in the way of code/data? If the data or code is sensitive is it possible to create a minimal example with fake data? Thanks!

mauicv commented 2 years ago

FYI: I think it's almost certainly related to this issue. It sounds like the issue is in the data format. The categorical variables have to be either one-hot encoded or label/ordinal-encoded otherwise it won't work. I'd need more info to be sure though.

HeyItsBethany3 commented 2 years ago

Hi @mauicv, Thank you, this was the issue - I was using a dataframe for xTrain instead of a numpy array. I am still getting this error below, probably because the categorical variables are not encoded.

The dataframe has numerical and some categorical variables. It also has an index of id's which I've had to treat as a model variable even though its immutable.

Do the variables definitely need to be encoded? I have a black box model that I can't modify. It takes in the variables as they are, without any processing, and must encode them under the hood. I'm concerned that if I encode the variables I will get much different results for the model or the model won't work. Is there any way to modify alibi code so this could work? My prediction function itself works fine, taking in the raw data and returning the prediction.

explainer.fit(numpy_xTrain, disc_perc=[25, 50, 75])
  File ".../lib/python3.8/site-packages/alibi/explainers/anchor_tabular.py", line 769, in fit
    disc = Discretizer(train_data, self.numerical_features, self.feature_names, percentiles=disc_perc)
  File "..../lib/python3.8/site-packages/alibi/utils/discretizer.py", line 30, in __init__
    bins = self.bins(data)
  File "..../lib/python3.8/site-packages/alibi/utils/discretizer.py", line 88, in bins
    qts = np.array(np.percentile(data[:, feature], self.percentiles))
  File "<__array_function__ internals>", line 5, in percentile
  File ".../lib/python3.8/site-packages/numpy/lib/function_base.py", line 3867, in percentile
    return _quantile_unchecked(
  File "..../lib/python3.8/site-packages/numpy/lib/function_base.py", line 3986, in _quantile_unchecked
    r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
  File "..../lib/python3.8/site-packages/numpy/lib/function_base.py", line 3564, in _ureduce
    r = func(a, **kwargs)
  File ".../lib/python3.8/site-packages/numpy/lib/function_base.py", line 4112, in _quantile_ureduce_func
    r = _lerp(x_below, x_above, weights_above, out=out)
  File "..../lib/python3.8/site-packages/numpy/lib/function_base.py", line 4009, in _lerp
    diff_b_a = subtract(b, a)
TypeError: unsupported operand type(s) for -: 'str' and 'str'

mauicv commented 2 years ago

Hey @HeyItsBethany3, Is it possible to define a map between the alibi-compliant format and your model-compliant format? If so then you can do the following:

# f takes the data format your model uses and encodes it into the format alibi likes
def f(X: np.ndarray, **kwargs) -> np.ndarray:  # use **kwargs for any other information needed to do the conversion
    Z_num = extract_numeric(X, **kwargs)  # extract columns like 49.5,  Z_num is now a homogenous array of numbers
    Z_cat = extract_cat(X, **kwargs)  # take columns like 'Male' and convert to 0, Z_cat is now a homogenous array of numbers
    Z = combine(Z_num, Z_cat, **kwargs)  # concatenate columns in the right order
    return Z

# the f inverse function takes the encoded data alibi needs and decodes it into the format your model uses.
def f_inv(Z: np.ndarray, **kwargs) -> np.ndarray:
    ... # do similar operations as above
    return Z

This allows you to define a wrapped model as:

def M_hat(Z: np.ndarray) -> np.ndarray: # Z here is the encoded data that doesn't work with your model
    X = f_inv(Z) # X is now the data that does work with your model
    pred = M(X)
    return pred

This you can pass into the AnchorTabular explainer, you will have to convert the dataset in the fit method as well into the alibi-compliant data.

explainer.fit(f(model_compliant_data), disc_perc=[25, 50, 75])

This way the data passed to the explainer will be in the correct format and doesn't require modifying the model itself.

HeyItsBethany3 commented 2 years ago

Hi @mauicv. Thank you for this explanation! I'll implement this and let you know how it goes.

I don't know the exact mapping of the variables but I know they are likely ordinally encoded.

Please could explain why alibi needs to know the exact mappings? I understand that they need variables in an encoded format for compatibility and for the code to work. But how do the encodings affect the way explanations are generated? What is the effect of getting this mapping slightly wrong compared to the underlying model?

Thank you!

HeyItsBethany3 commented 2 years ago

Hi @mauicv, I hope you are well! The wrapper model worked, thank you so much for your help!

The main problem I'm now having is computation time. It takes around 20-30 minutes to generate one explanation for an instance. I've implemented the DistributedAnchorTabular explainer but this does not speed up the explanation very much. Should I annotate my functions eg. predict_fn, with ray annotations like @ray.remote or is it all taken care of within this explainer? Which parts of the anchor generation process are being parallelised?

Thank you

HeyItsBethany3 commented 2 years ago

I'm also getting a strange error with fitting the explainer:

predict_fn = lambda x: predictor.score(x)

encoded_xTrain = encoder.encode_data(data.xTrain.reset_index())

category_map = encoder.get_category_map(encoded_xTrain)
features = encoder.get_features()

explainer = AnchorTabular(predict_fn, feature_names=features, categorical_names=category_map, seed=1)
explainer.fit(encoded_xTrain, disc_perc=[25, 50, 75])

data.xTrain is a pandas dataframe, and encode_data converts it into a numpy array so encoded_xTrain is of form [[var1 var2 ..... ] [var1 var2 ....]]. The encoder converts categorical variables into ordinally encoded variables. Feature names is the form [cat1, .... ,cat10, num1, .... ,num50] which is also the order that the features appear in the encoded data. Category map is of the form {cat1: [0, 1, 2], cat2: [0,1,2,3]....}. The strange thing is, the command predict_fn(encoded_xTrain) works completely fine but when fit to the explainer it doesn't.

The error I'm getting is this

File "test_model.py", line 35, in <module>
    explainer.fit(encoded_xTrain, disc_perc=[25, 50, 75])
  File "env/lib/python3.8/site-packages/alibi/explainers/anchor_tabular.py", line 782, in fit
    self.samplers = [sampler.deferred_init(train_data, d_train_data)]
  File "env/lib/python3.8/site-packages/alibi/explainers/anchor_tabular.py", line 92, in deferred_init
    self.val2idx = self._get_data_index()
  File "env/collie_env/lib/python3.8/site-packages/alibi/explainers/anchor_tabular.py", line 170, in _get_data_index
    val2idx[feat][value] = (self.d_train_data[:, feat] == value).nonzero()[0]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Thanks so much for your help

HeyItsBethany3 commented 2 years ago

@mauicv I've fixed this error with using a different category. I use the string values of the categories eg. {cat1: ["a", "b"]}. Why does this work as I am passing in the encoded data?

Also, [batch_size](https://docs.seldon.io/projects/alibi/en/latest/api/alibi.explainers.anchor_tabular.html) for AnchorTabular is 100 by default. Does this mean that only 100 instances of the training data is ever used or does the whole set get used? Thanks so much, really appreciate your help.

mauicv commented 2 years ago

Hey @HeyItsBethany3,

Sorry I've been a quite busy this last week. I'll try to answer some of your questions:

The main problem I'm now having is computation time. It takes around 20-30 minutes to generate one explanation for an instance.

This seems strange. Unless your tabular data has a very large number of features it shouldn't be taking that long? Can you describe your data a little bit? What's the total number of features? How many categorical features do you have and what is the number of categories for each? Perhaps you could try running a profiler to see where the majority of computation is occurring. If you send the output I might be better able to help. What size of batch_size are you using in the explain method?

In terms of DistributedAnchorTabular you don't need to annotate your functions. DistributedAnchorTabular distributes the sampling process for anchors. So in order to create an anchor, you have to be able to evaluate the performance of sets of candidate anchors in order to choose the best. This involves drawing samples that are both in the anchor and in the dataset and then computing the model prediction on these samples. It's this process that is parallelised in DistributedAnchorTabular.

w.r.t.

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I've tried to guess what you are doing here. I was able to reproduce the error with:

import numpy as np
import random
from alibi.explainers import AnchorTabular

pred_fn = lambda x: np.array([1])
cat_map = {'cat_feat': ['a', 'b']}
feat_names = ['cat_feat', 'num_feat']
explainer = AnchorTabular(pred_fn, feature_names=feat_names, categorical_names=cat_map)

print(f'{explainer.numerical_features=}')
print(f'{explainer.categorical_features=}')

data = []
for _ in range(1000):
    data.append([random.choice([0, 1]), random.randint(0, 100)])
data = np.array(data, dtype=int)

print(data[0])

explainer.fit(data, disc_perc=[25, 50, 75])

If you run the above you get:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

The issue in the above is that the cat_map variable should be a dictionary with column indcies as keys and a list of feature names as values (see here for an example). In the above if I change: cat_map = {'cat_feat': ['a', 'b']} to cat_map = {0: ['a', 'b']} the error resolves! It's not completely clear to me if this is the error you've got as I'm not sure what cat1 is in: {cat1: ["a", "b"]}?

__EDIT: The following description of batch_size isn't completely correct. See following comment for clarification.__

w.r.t. your question about batch_size. From here:

batch_size: the batch size used for sampling. A bigger batch size gives more confidence in the anchor, again at the expense of computation time since it involves more model prediction calls. The default value is 100.

so similar to the explanation above, when we're evaluating a candidate anchor we need to draw samples from it. batch_size here is the number of samples we draw per candidate anchor. During the process of obtaining an Anchor, we need to consider many candidate anchors, but per each candidate anchor, only 100 samples from the data set will be considered unless you change batch_size.

HeyItsBethany3 commented 2 years ago

Hey @mauicv Thank you for your reply! I'll definitely have a look into a profiler. I'm using a large dataset with 120 variables, 10 of which are categorical. Most of the categorical variables have less than 6 categories (except for one, which has 9 categories). I'm using the default batch_size that's 100. There are 15,000 rows in my training data and 5,000 in my test data.

Thank you for your explanation about the batch_size. I'm still a bit confused, should I increase the batch size to 15,000 to use the whole of the dataset or does the training data get used in other ways?

Thanks, yes this was the error! My category map now looks like category_map = {0: ['a', 'b', 'c', 'd', 'e', 'f']}, which works but I was confused as I thought the category labels 'a', 'b', 'c' should have been the encoded versions instead ie. [0,1,2].

mauicv commented 2 years ago

Hey @HeyItsBethany3, Sorry, I've realised I've misled you slightly w.r.t. the batch_size explanation:

The way we build anchors is we:

Start with an initial empty candidate anchor, i.e. the whole dataset
Generate a set of candidate anchors from the initial candidate anchor
Compute the best candidate anchor of the set
Use this anchor to generate the next set
Repeat 1-4 until we have a good enough anchor.

Part 3 of this algorithm requires choosing the best anchor from a set of anchors. To do this we draw batch_size number of samples from each anchor to compute their performance. We keep doing this until we are confident in one of them being the best. We then use that anchor to generate the next set and so on. The original paper contains a better description of this.

It's possible that the algorithm spends a long time sampling in order to be confident enough in the best anchor but I think it's actually more likely to be the size of your data set. 110 numerical features and 10 categorical features could be big enough to cause the computation to take a while. I think the best way to tell is using a profiler.

HeyItsBethany3 commented 2 years ago

Thanks so much for all your help with this and your explanation. I have a few questions on the technical details of precision and coverage:

How are instances sampled? (For instance when building the anchors like above or when computing precision/coverage?) Are the coverage samples local to the instance to explain or global?
Is precision calculated over the whole data set or a subset? How many samples are used?
Coverage is the fraction of times when the anchor applies to other instances. What does 'applies to other instances' mean? For instance with an anchor 'If age=23 then income<17000', does this anchor apply to other instances when the condition holds (age=23) or when both the condition and prediction holds?
The best anchor explanation for one instance might apply to other instances but might not be the best explanation for them. How do we know that the anchor explanation is a suitable explanation for the other instances used to compute coverage?
How is batch size used for counterfactuals?

Thanks so much.

mauicv commented 2 years ago

Hey @HeyItsBethany3, Good questions! I'll try my best to answer them. One request, could you open a new issue for the Counterfactual question? We try to keep issues scoped to one topic as it makes it easier for other users of alibi to find answers to questions they might have.

1. Sampling process

How are instances sampled? (For instance when building the anchors like above or when computing precision/coverage?) Are the coverage samples local to the instance to explain or global?

The instance sampling process for points in the anchor is very dependent on data modality. Its also quite difficult to do and is one of the issues with this method. In the case of tabular data, we sample from an anchor by first drawing N samples from the entire dataset. We then find all the rows in the data that satisfy the anchor condition and replace the features in the original N samples so that they now satisfy the anchor condition. It may be easiest to see with an example. Suppose you have the data set:

age	education-level
23	HS-grad
20	HS-grad
19	HS-grad
23	Bachelors
29	Masters
29	Doctorate

Suppose N=3 and suppose the anchor is [age>=20, age<=23] then to sample from the above we first randomly sample 3 instances from the entire dataset. perhaps we obtain: (23, HS-grad), (29, Doctorate) and (29, Masters). Next, we randomly sample from the anchor itself: perhaps we get (23, HS-grad), (20, HS-grad), (23, Bachelors). We then splice the age feature into the original samples so that they now sit in the anchor. Doing so will get (23, HS-grad), (20, Doctorate), (23, Masters). These are now the samples that we consider as drawn from the anchor and which we use to compute the precision.

Note: There are some subtleties as to what happens if we don't have enough samples in the anchor to do the above. In those cases, we sample from weaker anchors (anchors with fewer conditions) and replace features that don't satisfy the full anchor feature conditions with feature values from the dataset that do. I can go into more detail here if you like?

In order to calculate the coverage, we sample from the whole dataset and calculate the fraction of instances sampled that sit in the anchor. You can set the number of samples to draw using the coverage_samples parameter.

The docs give more detail on sampling for other data modalities.

2. Precision

Is precision calculated over the whole data set or a subset? How many samples are used?

The precision of a given anchor is the fraction of samples within that anchor that obtain the same prediction as the instance your explaining under the model. So in the above example, we've drawn the three samples (23, HS-grad), (20, Doctorate), (23, Masters) from the anchor. We then put each of these samples through the model in order to see what prediction it gives: perhaps <17000, <17000 and >17000. Then if the original instance being explained had a prediction of <17000 the overall estimated precision is 2/3. So to your question the samples are drawn from the subset of the data corresponding to the anchor and not the whole dataset.

In terms of how many samples we use to compute the precision of an anchor, it depends on how confident we are that it's better than other anchors. Recall that we're building these anchors up from an initial anchor that's the whole dataset. At each stage of this process we have a set of candidate anchors and we want to compute the highest precision candidate which we'll then use to create the next generation of candidate anchors. Sometimes you might have two candidate anchors that are pretty evenly matched in terms of precision, and in this case, we might need to draw many samples before we can confidently say that one is better than the other. In other cases, one anchor might obviously be better and we only need to draw a small number of samples to be sure it is best. We do this in an iterative process drawing batch_size samples each time. So we use at least batch_size samples to compute the precision.

3. Coverage:

Coverage is the fraction of times when the anchor applies to other instances. What does 'applies to other instances' mean? For instance with an anchor 'If age=23 then income<17000', does this anchor apply to other instances when the condition holds (age=23) or when both the condition and prediction holds?

I'm not sure what you are quoting here but I think it refers to the fact that the anchor contains by construction the instance we're explaining. In this sense the coverage is the proportion of other instances in the dataset that the anchor applies to? I'm not sure though.

In your example, the anchor applies to instances where the condition holds not where the prediction holds. The proportion of the dataset for which the anchor holds is the coverage. The idea is that we construct the anchor so that any instance in the anchor should obtain the same prediction income<17000 with high probability. So instances in the anchor should likely match the prediction by construction.

4. Instance Explanations

The best anchor explanation for one instance might apply to other instances but might not be the best explanation for them. How do we know that the anchor explanation is a suitable explanation for the other instances used to compute coverage?

I'm not sure what you're quoting here so I'm not exactly sure how to interpret it? Can you link me to where you read this?

In general, Anchors derived from different instances will be different because they'll be constructing those anchors using the feature values from those different instances. Perhaps a feature value present in one instance will turn out to be more significant for the model's prediction on that instance than for the value of the same feature present on a different instance. Those two instances could still be in each others anchors however. Anchors apply to instances in the sense that if an instance is in an anchor then its a strong indication that it'll have the same model prediction as the instance the anchor explains.

btw this book: interpretable-ml is a great resource that explains the above better than I do.

HeyItsBethany3 commented 2 years ago

@mauicv Thank you so much for your detailed response! This makes a lot of sense.

When N instances are sampled from the dataset, are they sampled randomly?
I'm interested in why you use this slicing method instead of taking real data points when you perform the sampling. Do you find this results in unrealistic data instances, especially when features are correlated?
What level of coverage do you generally view as desirable? My coverage tends to be around 20% which I thought was reasonable, but interested in your thoughts.
Do you think it's possible to precompute anchors to coverage the whole of the feature space so that we can essentially have a look-up table for anchors. Or does it have to be built on a per sample basis?

Thank you!

mauicv commented 2 years ago

hey @HeyItsBethany3,

When N instances are sampled from the dataset, are they sampled randomly?

Yes, you can see this in the implementation here.

I'm interested in why you use this slicing method instead of taking real data points when you perform the sampling. Do you find this results in unrealistic data instances, especially when features are correlated?

We intervene in the distribution in this way to break any correlation between the features the anchor has predicates for and those it doesn't. We do this becuase the model may be using those correlated features to make predictions. So if an anchor has predicates on features that are correlated with features the anchor does not have predicates for then if we naively sample from the anchor the model might be making predictions based on those correlated features rather than the ones the anchor is actually selecting on.

As an example suppose you have a dataset of two numerical features and they're strongly correlated. Because they're so strongly correlated the model could end up learning to base it's predictions on just one of them rather than the other. If we obtain an anchor for this model we'd hope that it would give us an explanation in terms of the feature it's actually using rather than the one it's not. However if we sample those instances in the dataset that are in the anchor then because of this correlation we'd get a high precision for anchors based on the feature the model doesn't actually care about. If we break this correlation by sampling the feature the anchor doesn't have a predicate for from the dataset as a whole then precision will reflect this.

What level of coverage do you generally view as desirable? My coverage tends to be around 20% which I thought was reasonable, but interested in your thoughts.

It really depends on the model and the dataset. The main factor that impacts coverage is the proximity of the instance being explained to a decision boundary. If its close then it's likely to have lower coverage than anchors away from the decision boundary.

Do you think it's possible to precompute anchors to coverage the whole of the feature space so that we can essentially have a look-up table for anchors. Or does it have to be built on a per sample basis?

Can you explain what you mean here precisely? I don't think I understand. In the paper they simulate a user study in which they generate a set of anchors that cover as much of the feature space as possible and then use them in lieu of the model. This sounds like what your describing but I'm not sure?

SeldonIO / alibi