Tokenization problem in words containing letters with cedilla

mehmetcalikus commented 2 years ago

Hello, I am trying to use the tokenizer of Bert model for a different language. There are some different cedilla containing letters in the language I use. For example Ş, ş, Ç, ç, Ö, ö etc. However, word piece encoder compares these letters as two different chars when comparing substrings. For example, when it encounters the character Ş, it tokenizes it as S and ##̧. In other words, the dot below S is perceived as a separate token.

What do I need to do to fix this?

frankfliu commented 2 years ago

@mehmetcalikus Which tokenizer are you using? Are you using sentnecepiece or huggingface?

mehmetcalikus commented 2 years ago

I'm trying to use huggingface's tokenizer. Wordpiecetokenizer tokenize method I'm having problem with is in the link below(1). Actually, before this method works, the Text cleaner(2) method works and cleans all the characters except the English characters in the text I gave as input. This is breaking my input. That's why I am not calling that function. But this time, there is a problem in the WordpieceTokenizer class, where it makes string comparisons.

1: https://github.com/deepjavalibrary/djl/blob/master/api/src/main/java/ai/djl/modality/nlp/bert/WordpieceTokenizer.java 2: https://github.com/deepjavalibrary/djl/blob/05ad89af6a6c8e184b1222cc47ec3a3513b762ad/api/src/main/java/ai/djl/modality/nlp/preprocess/TextCleaner.java

frankfliu commented 2 years ago

If you are using hugingface tokenizer, you don't really need WordpieceTokenizer.java See: https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/src/test/java/ai/djl/huggingface/tokenizers/HuggingFaceTokenizerTest.java#L34-L35

By the way, did you try Python and see if python works?

mehmetcalikus commented 2 years ago

Yes, I tried the method you mentioned. It works for the bert-base-cased tokenizer. But when I try like this for a tokenizer used for another language:

HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance("dbmdz/bert-base-turkish-cased");

It gives an error:

Exception in thread "main" java.lang.RuntimeException: Model "dbmdz/bert-base-turkish-cased" on the Hub doesn't have a tokenizer at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizer(Native Method) at ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.newInstance(HuggingFaceTokenizer.java:61) at ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.newInstance(HuggingFaceTokenizer.java:48) at MyTokenizer.main(MyTokenizer.java:16)

But this tokenizer actually belongs to a model in huggingface, as you can see in the link below. Any idea why it's not working? https://huggingface.co/dbmdz/bert-base-turkish-cased

By the way, I normally use this type of tokenizers or models over python. But I am trying to migrate model and tokenizer to cpp or java for production side. That's why I'm not running python, otherwise it works fine in python

frankfliu commented 2 years ago

The djl huggingface tokenizer uses fast tokenizer rust API, it requires tokenizer.json file. If the file not found, the python code will fallback to old pure python implementation of tokenizer.

You should be able to manually download vocab.txt file and use BertFullTokenizer.java, see example code: https://github.com/deepjavalibrary/djl/blob/master/examples/src/main/java/ai/djl/examples/inference/BertClassification.java#L105-L110

mehmetcalikus commented 2 years ago

I guess now I'm back to where I was having trouble in the first place. I also tried to run it on the tokenizer there at the beginning. By downloading the vocab.txt file and giving its path. But in that case, there is the problem that I wrote in the first message.

frankfliu commented 2 years ago

@mehmetcalikus

Would you mind create a test case in Java and Python. I can take a look what's the gap between Python and Java implementation.

mehmetcalikus commented 2 years ago

Sorry for the late reply. It works when I give the path of tokenizer.json to the Huggingface tokenizer class from local. When I download the tokenizer from Hugging face with a python script, the tokenizer.json file is also created just like in "bert-base-cased". I don't understand why it's not showing up on the Huggingface site.

Regarding the HuggingfaceTokenizer.java class of the library, it does not work as functionally as in python, I think there are no features such as padding and truncation, but it allows me to use the tokenizer I want.

Thanks for your advice and help.

frankfliu commented 2 years ago

@mehmetcalikus

Would you mind share your python script. I'm interested in how python code generate tokenizer.json file. And also provide the turkish input, so I can create a test case to cover this issue.

mehmetcalikus commented 2 years ago

I downloaded the tokenizer with the following script. The generated file has tokenizer.json in it.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased").save_pretrained("tr_bert_tokenizer")

Then I used the downloaded json file in java as follows.

Map<String, String> options = new ConcurrentHashMap<>();
options.put("addSpecialTokens", "true");
String inputs = "Merhaba, hoşgeldiniz nasılsınız? Nerede yaşıyorsunuz?";
Path path = Paths.get("/path/to/tokenizer.json");
HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance(path, options);
Encoding encoding = tokenizer.encode(inputs);
String[] tokens = encoding.getTokens();
long[] ids = encoding.getIds();
long[] attentionMask = encoding.getAttentionMask();
long[] tokenTypeids = encoding.getTypeIds();

You can produce the same output when you run the following python script.

text = "Merhaba, hoşgeldiniz nasılsınız? Nerede yaşıyorsunuz?"
tokenized_text = tokenizer.tokenize(text)
encoded_text = tokenizer.encode_plus(text, add_special_tokens=True)

But as I mentioned before, features such as truncation and padding in python are not available in djl's huggingface tokenizer(As far as I know). Maybe if these features can be added, it will be possible to use non-english tokenizers in java as functionally as in python.

frankfliu commented 2 years ago

@mehmetcalikus

DJL do support add_special_tokens feature:

Map<String, String> options = new HashMap<>();
options.put("addSpecialTokens", "true");
Path path = Paths.get(...);
HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance(path, options);

siddvenk commented 2 years ago

@mehmetcalikus we have recently enhanced the DJL Huggingface Tokenzier implementation with lots of new features including padding and truncation, as well as an updated model zoo.

The updated Tokenizer should be better suited for working with non-english tokenizers now. I would recommend trying out the updated HF Tokenizer from DJL which can be found in version 0.19.0, or 0.20.-SNAPSHOT. We plan to keep enhancing the HF Tokenizer with additional capabilities.

Beyond that, is there anything else we can help with here?

frankfliu commented 2 years ago

@mehmetcalikus Since DJL 0.19.0, you can use HuggingFaceTokenizer.builder() to build your tokenizer see: https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/src/test/java/ai/djl/huggingface/tokenizers/HuggingFaceTokenizerTest.java#L287-L294

frankfliu commented 2 years ago

Feel free to reopen this issue if you have further questions

deepjavalibrary / djl

Tokenization problem in words containing letters with cedilla #1528