PAIR-code / lit

The Learning Interpretability Tool: Interactively analyze ML models to understand their behavior in an extensible and framework agnostic interface.
Apache License 2.0
3.5k stars 357 forks source link

LIME tokenizer for SentencePiece (or other tokenizer) #361

Open knok opened 3 years ago

knok commented 3 years ago

Curently, it seems just use str.split so it didn't work with non-space segmented languages like Japanese.

I tried to use it with SentencePiece-based model (japanese-ALBERT), but it handle input sentence as single word. I think it would be goot to replace model._model.tokenizer.tokenize instead of str.split.

knok commented 3 years ago

Unfortunately, the change didn't work well.

jameswex commented 3 years ago

For LIME (and other ablation-style techniques), we want to tokenize on full words and not word pieces, which the model tokenizer might do. Is there a simple way to do word-based tokenization for non-space segmented languages?

knok commented 3 years ago

I don't know another languages about non-space segmented language (maybe Chinese, Thai, ...), at least Japanese "word" is a little bit ambiguous consept. To segment to words , you need to use morphological analyser like MeCab and dictionaries.

Japanese BERT tokenizer in Transformers uses MeCab and SentencePiiece, but ALBERT is not.