BERT/Eelectra vocab file and tokenizer model.

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.55k stars 538 forks source link

BERT/Eelectra vocab file and tokenizer model. #1451

Closed araitats closed 3 years ago

araitats commented 3 years ago

Description

The default learn_subword returns special tokens with "", "~~", "~~", and "". By convention, BERT uses [UNK] [PAD] [CLS] [SEP] and [MASK]. How can we define --custom-special-tokens flag so that the tokenizer model and vocab file have these BERT special tokens? What is the best practice for that?

References

list reference and related literature
list known implementations

sxjscience commented 3 years ago

@araitats This should have been solved. I'll close this issue.