LIME tokenizer for SentencePiece (or other tokenizer)

PAIR-code / lit

The Learning Interpretability Tool: Interactively analyze ML models to understand their behavior in an extensible and framework agnostic interface.

https://pair-code.github.io/lit

Apache License 2.0

3.46k stars 352 forks source link

LIME tokenizer for SentencePiece (or other tokenizer) #361

Open knok opened 3 years ago

knok commented 3 years ago

Curently, it seems just use str.split so it didn't work with non-space segmented languages like Japanese.

https://github.com/PAIR-code/lit/blob/3eb824b01e0f72a5486124b16056bf912465debc/lit_nlp/components/citrus/lime.py#L85

I tried to use it with SentencePiece-based model (japanese-ALBERT), but it handle input sentence as single word. I think it would be goot to replace model._model.tokenizer.tokenize instead of str.split.

knok commented 3 years ago

Unfortunately, the change didn't work well.

jameswex commented 3 years ago

For LIME (and other ablation-style techniques), we want to tokenize on full words and not word pieces, which the model tokenizer might do. Is there a simple way to do word-based tokenization for non-space segmented languages?

knok commented 3 years ago

I don't know another languages about non-space segmented language (maybe Chinese, Thai, ...), at least Japanese "word" is a little bit ambiguous consept. To segment to words , you need to use morphological analyser like MeCab and dictionaries.

Japanese BERT tokenizer in Transformers uses MeCab and SentencePiiece, but ALBERT is not.