customizing reconstruction of sentences from tokens

APJansen commented 2 years ago

Currently it seems sentences are rebuilt from the token strings by simply joining them together with spaces in between, and there is no clean way to customize this. Often this is not what you want, I'm now trying to apply it to a model that works on string representations of molecules ("SMILES"), and there you don't want the spaces. Granted this is not necessarily what it was intended for, but also many BERT based models use sub-word tokenization, for example tokenizing "playing" as ["play", "##ing"] with the ## indicating there is no space. This would get expanded into "play ##ing" instead of just "playing".

Perhaps _create_masked_sentences in RISE can be split into something like _mask_tokens which outputs a list, and _reconstruct_sentences which can be set with an argument that defaults to the current behavior? (I'm looking at the transformers library, there tokenizers come equipped with a convert_tokens_to_string method, which does exactly this.)

elboyran commented 2 years ago

@APJansen , thanks for the comment. Would you like to submit a pull request?

APJansen commented 2 years ago

in RISE there are 3 aspects of tokenization that are all handled differently:

the tokenizer itself (string -> list of strings) is required to be a method of model_or_function
the mask_string is an argument defaulting to "UNKWORDZ" (this should match the actual mask token the model was trained with right?)
the "detokenizer" (list of strings -> strings) is hardcoded to concatenation with spaces

How about we unite these into a Tokenizer class, with methods tokenize and detokenize, and attribute mask_token? mask_token instead of mask_string makes it compatible with the transformers library, which has such a class with this attribute and a tokenize method. It doesn't have a detokenize method or anything similar.

But do we really need to detokenize? Right now it does this:

tokenize the input string (sentence)
replace some tokens with the mask tokens
detokenize back into a single string (sentence)
(within the model itself) tokenize again, retrieving step 2
apply the model on the tokens

So clearly here step 3 and 4 could be skipped, at the cost of then requiring the model to be able to run on a list of token strings rather than sentences. No idea if this is an issue for the models you think this will be used on. In the transformers library models run on lists of token ids rather than token strings, so this issue would be solved simply with the already present convert_tokens_to_ids method in their tokenizer class.

The reason I'm thinking about this is that I worry how to make sure that step 4 above actually gives the same as step 2. If tokens are words then just pasting them back with spaces in between should work. For SMILES just pasting them back without anything in between should probably work. For tokenizers using subwords as I mentioned above you can also cook up something, like paste with space inbetween unless second word starts with ##, in which case remove ## and paste without space. But I don't know if that covers everything, and not doing this seems much cleaner and more likely to extend to other tokenizers we don't know about. (So in this case we'd replace the detokenize method with the convert_tokens_to_ids method)

Other than this, is there any need to detokenize? I thought you'd need it for the saliency map but I don't yet understand what that's doing, it's only using the predictions, but that's probably only possible when sticking to word level tokens.

And ideally we'd use the same Tokenizer class also for the LIME method, which now has yet a different syntax, the same mask_string but I guess char_level and split_expression control the (de)tokenization.

Sorry for the rambling, but it would be good to discuss this before I go implement something that's not suitable.

correction: the transformers library's Tokenizer does have a method to detokenize, it's called convert_tokens_to_string

elboyran commented 2 years ago

Hi @APJansen , lets label it a standup issue.

APJansen commented 2 years ago

Ok, can you invite me for the standups?

APJansen commented 2 years ago

I've added a basic tokenizer class with a word based one that can be used as default, along with a notebook doing some tests on them. Before integrating it into RISE I need to fix #300 so that I can test what I'm doing.

The tests on the tokenizers reveal an issue with the huggingface tokenizer for language (though I think not with the one for chemistry): for a list of tokens with some masked, de-tokenizing it and tokenizing again (which is currently done in RISE) does not give back the same tokens. In particular a subword token that is a suffix, such as ##ing, gets mapped to ing (without the ## indicating that it's a suffix). Since this doesn't occur in the chemistry tokenizer I will leave this for later.

dianna-ai / dianna

customizing reconstruction of sentences from tokens #242