Open knok opened 3 years ago
Unfortunately, the change didn't work well.
For LIME (and other ablation-style techniques), we want to tokenize on full words and not word pieces, which the model tokenizer might do. Is there a simple way to do word-based tokenization for non-space segmented languages?
I don't know another languages about non-space segmented language (maybe Chinese, Thai, ...), at least Japanese "word" is a little bit ambiguous consept. To segment to words , you need to use morphological analyser like MeCab and dictionaries.
Japanese BERT tokenizer in Transformers uses MeCab and SentencePiiece, but ALBERT is not.
Curently, it seems just use
str.split
so it didn't work with non-space segmented languages like Japanese.https://github.com/PAIR-code/lit/blob/3eb824b01e0f72a5486124b16056bf912465debc/lit_nlp/components/citrus/lime.py#L85
I tried to use it with SentencePiece-based model (japanese-ALBERT), but it handle input sentence as single word. I think it would be goot to replace
model._model.tokenizer.tokenize
instead ofstr.split
.