dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

nlp.models.bert.get_pretrained_bert provides slow tokenizer #1538

Closed leezu closed 3 years ago

leezu commented 3 years ago

LegacyHuggingFaceTokenizer instead of https://github.com/huggingface/tokenizers

sxjscience commented 3 years ago

I think both are using tokenizers but the legacy one is following the API of an older version of HF tokenizers.

leezu commented 3 years ago

The transformers.PreTrainedTokenizer is "Base class for all slow tokenizers." compared to transformers.PreTrainedTokenizerFast which is "Base class for all fast tokenizers (wrapping HuggingFace tokenizers library)." so only the latter uses HF tokenizers

sxjscience commented 3 years ago

We are calling the tokenizers package directly in the implementation.