Closed mehmetcalikus closed 2 years ago
@mehmetcalikus Which tokenizer are you using? Are you using sentnecepiece or huggingface?
I'm trying to use huggingface's tokenizer. Wordpiecetokenizer tokenize method I'm having problem with is in the link below(1). Actually, before this method works, the Text cleaner(2) method works and cleans all the characters except the English characters in the text I gave as input. This is breaking my input. That's why I am not calling that function. But this time, there is a problem in the WordpieceTokenizer class, where it makes string comparisons.
1: https://github.com/deepjavalibrary/djl/blob/master/api/src/main/java/ai/djl/modality/nlp/bert/WordpieceTokenizer.java 2: https://github.com/deepjavalibrary/djl/blob/05ad89af6a6c8e184b1222cc47ec3a3513b762ad/api/src/main/java/ai/djl/modality/nlp/preprocess/TextCleaner.java
If you are using hugingface tokenizer, you don't really need WordpieceTokenizer.java
See: https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/src/test/java/ai/djl/huggingface/tokenizers/HuggingFaceTokenizerTest.java#L34-L35
By the way, did you try Python and see if python works?
Yes, I tried the method you mentioned. It works for the bert-base-cased tokenizer. But when I try like this for a tokenizer used for another language:
HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance("dbmdz/bert-base-turkish-cased");
It gives an error:
Exception in thread "main" java.lang.RuntimeException: Model "dbmdz/bert-base-turkish-cased" on the Hub doesn't have a tokenizer at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizer(Native Method) at ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.newInstance(HuggingFaceTokenizer.java:61) at ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.newInstance(HuggingFaceTokenizer.java:48) at MyTokenizer.main(MyTokenizer.java:16)
But this tokenizer actually belongs to a model in huggingface, as you can see in the link below. Any idea why it's not working? https://huggingface.co/dbmdz/bert-base-turkish-cased
By the way, I normally use this type of tokenizers or models over python. But I am trying to migrate model and tokenizer to cpp or java for production side. That's why I'm not running python, otherwise it works fine in python
The djl huggingface tokenizer uses fast tokenizer rust API, it requires tokenizer.json
file. If the file not found, the python code will fallback to old pure python implementation of tokenizer.
You should be able to manually download vocab.txt
file and use BertFullTokenizer.java
, see example code: https://github.com/deepjavalibrary/djl/blob/master/examples/src/main/java/ai/djl/examples/inference/BertClassification.java#L105-L110
I guess now I'm back to where I was having trouble in the first place. I also tried to run it on the tokenizer there at the beginning. By downloading the vocab.txt file and giving its path. But in that case, there is the problem that I wrote in the first message.
@mehmetcalikus
Would you mind create a test case in Java and Python. I can take a look what's the gap between Python and Java implementation.
Sorry for the late reply. It works when I give the path of tokenizer.json to the Huggingface tokenizer class from local. When I download the tokenizer from Hugging face with a python script, the tokenizer.json file is also created just like in "bert-base-cased". I don't understand why it's not showing up on the Huggingface site.
Regarding the HuggingfaceTokenizer.java class of the library, it does not work as functionally as in python, I think there are no features such as padding and truncation, but it allows me to use the tokenizer I want.
Thanks for your advice and help.
@mehmetcalikus
Would you mind share your python script. I'm interested in how python code generate tokenizer.json
file. And also provide the turkish input, so I can create a test case to cover this issue.
I downloaded the tokenizer with the following script. The generated file has tokenizer.json in it.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased").save_pretrained("tr_bert_tokenizer")
Then I used the downloaded json file in java as follows.
Map<String, String> options = new ConcurrentHashMap<>();
options.put("addSpecialTokens", "true");
String inputs = "Merhaba, hoşgeldiniz nasılsınız? Nerede yaşıyorsunuz?";
Path path = Paths.get("/path/to/tokenizer.json");
HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance(path, options);
Encoding encoding = tokenizer.encode(inputs);
String[] tokens = encoding.getTokens();
long[] ids = encoding.getIds();
long[] attentionMask = encoding.getAttentionMask();
long[] tokenTypeids = encoding.getTypeIds();
You can produce the same output when you run the following python script.
text = "Merhaba, hoşgeldiniz nasılsınız? Nerede yaşıyorsunuz?"
tokenized_text = tokenizer.tokenize(text)
encoded_text = tokenizer.encode_plus(text, add_special_tokens=True)
But as I mentioned before, features such as truncation and padding in python are not available in djl's huggingface tokenizer(As far as I know). Maybe if these features can be added, it will be possible to use non-english tokenizers in java as functionally as in python.
@mehmetcalikus
DJL do support add_special_tokens
feature:
Map<String, String> options = new HashMap<>();
options.put("addSpecialTokens", "true");
Path path = Paths.get(...);
HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance(path, options);
@mehmetcalikus we have recently enhanced the DJL Huggingface Tokenzier implementation with lots of new features including padding and truncation, as well as an updated model zoo.
The updated Tokenizer should be better suited for working with non-english tokenizers now. I would recommend trying out the updated HF Tokenizer from DJL which can be found in version 0.19.0, or 0.20.-SNAPSHOT. We plan to keep enhancing the HF Tokenizer with additional capabilities.
Beyond that, is there anything else we can help with here?
@mehmetcalikus
Since DJL 0.19.0, you can use HuggingFaceTokenizer.builder()
to build your tokenizer
see: https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/src/test/java/ai/djl/huggingface/tokenizers/HuggingFaceTokenizerTest.java#L287-L294
Feel free to reopen this issue if you have further questions
Hello, I am trying to use the tokenizer of Bert model for a different language. There are some different cedilla containing letters in the language I use. For example Ş, ş, Ç, ç, Ö, ö etc. However, word piece encoder compares these letters as two different chars when comparing substrings. For example, when it encounters the character Ş, it tokenizes it as S and ##̧. In other words, the dot below S is perceived as a separate token.
What do I need to do to fix this?