Closed APJansen closed 2 years ago
@APJansen , thanks for the comment. Would you like to submit a pull request?
in RISE there are 3 aspects of tokenization that are all handled differently:
model_or_function
mask_string
is an argument defaulting to "UNKWORDZ" (this should match the actual mask token the model was trained with right?)How about we unite these into a Tokenizer
class, with methods tokenize
and detokenize
, and attribute mask_token
?
mask_token
instead of mask_string
makes it compatible with the transformers library, which has such a class with this attribute and a tokenize
method.
It doesn't have a detokenize
method or anything similar.
But do we really need to detokenize? Right now it does this:
So clearly here step 3 and 4 could be skipped, at the cost of then requiring the model to be able to run on a list of token strings rather than sentences. No idea if this is an issue for the models you think this will be used on. In the transformers library models run on lists of token ids rather than token strings, so this issue would be solved simply with the already present convert_tokens_to_ids
method in their tokenizer class.
The reason I'm thinking about this is that I worry how to make sure that step 4 above actually gives the same as step 2. If tokens are words then just pasting them back with spaces in between should work. For SMILES just pasting them back without anything in between should probably work. For tokenizers using subwords as I mentioned above you can also cook up something, like paste with space inbetween unless second word starts with ##, in which case remove ## and paste without space. But I don't know if that covers everything, and not doing this seems much cleaner and more likely to extend to other tokenizers we don't know about. (So in this case we'd replace the detokenize
method with the convert_tokens_to_ids
method)
Other than this, is there any need to detokenize? I thought you'd need it for the saliency map but I don't yet understand what that's doing, it's only using the predictions, but that's probably only possible when sticking to word level tokens.
And ideally we'd use the same Tokenizer
class also for the LIME method, which now has yet a different syntax, the same mask_string
but I guess char_level
and split_expression
control the (de)tokenization.
Sorry for the rambling, but it would be good to discuss this before I go implement something that's not suitable.
correction: the transformers library's Tokenizer
does have a method to detokenize, it's called convert_tokens_to_string
Hi @APJansen , lets label it a standup issue.
Ok, can you invite me for the standups?
I've added a basic tokenizer class with a word based one that can be used as default, along with a notebook doing some tests on them. Before integrating it into RISE I need to fix #300 so that I can test what I'm doing.
The tests on the tokenizers reveal an issue with the huggingface tokenizer for language (though I think not with the one for chemistry): for a list of tokens with some masked, de-tokenizing it and tokenizing again (which is currently done in RISE) does not give back the same tokens. In particular a subword token that is a suffix, such as ##ing, gets mapped to ing (without the ## indicating that it's a suffix). Since this doesn't occur in the chemistry tokenizer I will leave this for later.
Currently it seems sentences are rebuilt from the token strings by simply joining them together with spaces in between, and there is no clean way to customize this. Often this is not what you want, I'm now trying to apply it to a model that works on string representations of molecules ("SMILES"), and there you don't want the spaces. Granted this is not necessarily what it was intended for, but also many BERT based models use sub-word tokenization, for example tokenizing "playing" as ["play", "##ing"] with the ## indicating there is no space. This would get expanded into "play ##ing" instead of just "playing".
Perhaps
_create_masked_sentences
inRISE
can be split into something like_mask_tokens
which outputs a list, and_reconstruct_sentences
which can be set with an argument that defaults to the current behavior? (I'm looking at the transformers library, there tokenizers come equipped with aconvert_tokens_to_string
method, which does exactly this.)