Open jklaise opened 4 years ago
@jklaise just my thought, instead of the wrapper (IGText
) dealing with a tokenizer in it's __init__()
, you can just pass the word_to_index dictionary
since most of the python based tokenizers implement a dictionary to convert tokens to token_ids. This would avoid the problem of dealing with multiple Frameworks (TensorFlow, PyTorch, etc. )
I think this is a good idea so given the word_to_index
dictionary we could use it to output meaningful explanations. However, unless I'm missing something, this is not enough for the method to run on raw text, the tokenizer still would need to do its job to produce the tokens.
@jklaise Ahh, yeah. You are right about that part. If the string processing is being taken care of by tokenizer's methods then this will be an issue. One way might be to have a function preprocessing()
in IGText
which does the splitting, ascii->unicode (or vice-versa), punctuation removal, etc. and output of this could be applied to get integer sequence by using word_to_index
but this might be inconsistent with how the base tokenizers process the strings.
@abhishek-niranjan yes that's exactly the issue, the tokenizer is model-specific so we shouldn't assume that a simple heuristic will work as we may be generating invalid tokens with respect to the model.
@jklaise right. So the idea initially suggested by you is the only ideal way to go ahead.
What needs to be explored if the proposal would work with all kinds of tokenizers and model backends (TensorFlow, PyTorch when we have IG for PyTorch).
And this will need to be thought over then. Are you looking to add this functionality once alibi is migrated to TF2.0?
@abhishek-niranjan this is likely to be a high priority task especially if people are interested in the functionality. Alibi does actually work on TF2.x now, although we are still working on converting the code-base to idiomatic TF2.x code together with the necessary abstractions to enable development of PyTorch methods in the future (interop-refactor branch).
@jklaise what do you think about the following:
class IGText(IntegradedGradients):
def __init__(model, layer, method, word_to_index, preprocessing_callable, n_steps, internal_batch_size):
. . .
def text_to_int(raw_string, preprocessing_callable, word_to_index):
return word_to_index(preprocessing_callable(raw_string))
The idea is to take a callable argument in the IGText
constructor. This will provide the user with the flexibility of choosing the preprocessing process from package-defined (spacy, etc.) or user-defined function whichever they have used in the tokenization process while creating the training/validation dataset.
@abhishek-niranjan yes, this could be an option, similar to what's discussed in #244 . This would be very flexible, although we should define then exactly what these callbacks should take and return. Also, we're looking into what's required for serializing explainer objects safely, custom callbacks could make this more challenging (although it's the same exact scenario we have now with black-box predictors).
@jklaise right, though I don't have much clue on serializing custom callbacks within objects but I could see how this could be challenging. As for the callback specific to this problem, the simpler it is the better. I mean, the end-task of this function is to return a np.array
given a raw_string
and word_to_index
dictionary, right? Everything else will be taken care of by base explain()
, right?
@abhishek-niranjan yes you're correct. In the coming days I will start looking at how adding custom pre- and post-processing callbacks could work with Alibi, this is important for us also to deploy explainers in complex inference graph setups with Seldon Core and KFServing.
This would be useful mostly from the point of view of application developers who wouldn't want to deal with fetching internal embedding layers and configuring those. We should think carefully about the design and what's possible to make this work.
Whilst
AnchorText
works directly on raw text,IntegratedGradients
works on the token level. One reason for this is thatIntegratedGradients
is use case agnostic - tabular data, images and text are handled in the same way at the cost of the user having to do pre- and post-processing for text to display explanations in the raw text space (this point is related to #244).This would amount to having a wrapper like
IGText
which is a text-specificIntegratedGradients
that works on arrays of strings. Fundamentally this means coupling the tokenizer together with the model. What needs to be explored if the proposal would work with all kinds of tokenizers and model backends (TensorFlow, PyTorch when we have IG for PyTorch).