Divergence between BertTokenizerFull and HuggingFace BertTokenizer.

BigSquirrel2000 commented 1 year ago

Description

BertTokenizerFull (precisely WordPieceTokenizer.java) in DJL is not handling subword cases that are frequent in languages like german. Problem is not present in DJL HugginFaceTokenizer, but it cannot be used on android (https://github.com/deepjavalibrary/djl/issues/2170).

Expected Behavior

For word that consists of subwords (e.g. "wochenendtagenwecker") and assuming that "wecker" (without ##) is present in vocab, "wochenendtagenwecker" should be tokenized as {"wo", "##chen", "##end", "##tagen", "wecker"}. At least, this is standard expected behavior in modern implementations of BERTTokenizer (both BertTokenizer and BertTokenizerFast in HuggingFace).

How to Reproduce?

import ai.djl.modality.nlp.DefaultVocabulary;
import ai.djl.modality.nlp.bert.BertFullTokenizer;
import java.util.Arrays;

  List<String> vocab = Arrays.asList("wo", "##chen", "##end", "##tagen", "wecker", "radiowecker");
  DefaultVocabulary vocabulary = new DefaultVocabulary(vocab);
  BertFullTokenizer tokenizer = new BertFullTokenizer(vocabulary, true);

  String a = "wochenendtagenwecker radiowecker";
  String[] expected = {"wo", "##chen", "##end", "##tagen", "wecker", "radiowecker"};
  List<String> tokenW = tokenizer.tokenize(a);

Steps to reproduce

add djl to gradle ;)

What have you tried to solve it?

From what I read, probably trie or aho-corasick in WordPieceTokenizer, is needed to fix this.

Environment Info

bug is not related to environment

siddvenk commented 1 year ago

This behavior seems in line with what I am observing using HuggingFace BertTokenizer with the same vocabulary used in the java example provided.

Here's the vocab.txt I'm using to instantiate the tokenizer

wo
##chen
##end
##tagen
wecker
radiowecker

Here's the python code I'm running:

from transformers import BertTokenizer
tokenizer = BertTokenizer('vocab.txt')
tokens = tokenizer.tokenize("wochenendtagenwecker radiowecker")
print(tokens)

The above yields ['[UNK]', 'radiowecker'], which is the same result as the java code you shared. In your test with huggingface in python did you use the same vocab as in your example, or the vocab from a pretrained tokenizer?

The solutions you shared (trie, aho-corasick) seem to be increasing speed of the tokenization process rather than changing the behavior of tokenization (though I haven't studied those techniques in detail to determine whether they modify the vocabulary through the data structure representation).

I found this implementation of WordPieceTokenizer in java for tflite-android which is doing pretty much the same thing we are (with respect to handling subwords) https://github.com/huggingface/tflite-android-transformers/blob/master/bert/src/main/java/co/huggingface/android_transformers/bertqa/tokenization/WordpieceTokenizer.java

BigSquirrel2000 commented 1 year ago

Hi, to replicate this behavior in python, you need to use add_tokens function.

vocab.txt can be either with subwords of wecker:

wo
##chen
##end
##tagen
##we
##cker

or without subwords of wecker:

wo
##chen
##end
##tagen

and then:

from transformers import BertTokenizer
tokenizer = BertTokenizer('vocab.txt')
tokenizer.add_tokens(["wecker", "radiowecker"])
print(tokenizer.tokenize("wochenendtagenwecker radiowecker"))
#['wo', '##chen', '##end', '##tagen', 'wecker', 'radiowecker']

deepjavalibrary / djl