dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

BERT/Eelectra vocab file and tokenizer model. #1451

Closed araitats closed 3 years ago

araitats commented 3 years ago

Description

The default learn_subword returns special tokens with "", "", "", and "". By convention, BERT uses [UNK] [PAD] [CLS] [SEP] and [MASK]. How can we define --custom-special-tokens flag so that the tokenizer model and vocab file have these BERT special tokens? What is the best practice for that?

References

sxjscience commented 3 years ago

@araitats This should have been solved. I'll close this issue.