Return tokenizer for pre-trained models

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.56k stars 538 forks source link

Return tokenizer for pre-trained models #1043

Open eric-haibin-lin opened 4 years ago

eric-haibin-lin commented 4 years ago

Users need to write additional code after they get a pre-trained model. The tokenizer usually remains the same as the one used during pre-training. Should our get_model API also return the tokenizer?

leezu commented 4 years ago

Yes, it's commonly needed to access the pre-trained tokenizer. Some API in the scripts folder also returns the tokenizer as part of get_model. For example

https://github.com/dmlc/gluon-nlp/blob/aff29217d4a1233d9b5a069366e2b80e30184e30/scripts/language_model/transformer/model.py#L198-L252

But I'm not sure if this a good API

eric-haibin-lin commented 4 years ago

I think it's more convenient than the current one. Users are free to discard it if they want to change it. Are there use cases where we want to change the tokenizer?

eric-haibin-lin commented 4 years ago

For reference, fairseq's RoBERTa's encoding function takes a string as the input. The model object holds the tokenizer, and applies the tokenizer directly on the input string. It might be hard to modify the tokenization method there, but I do see they couple the pre-trained model and tokenizer

leezu commented 4 years ago

I think it's more convenient than the current one. Users are free to discard it if they want to change it. Are there use cases where we want to change the tokenizer?

Yes, but those involve initializing the model randomly and training it.

A better API may be to have separate functions for obtaining each of pretrained model, vocab and tokenizer.

sxjscience commented 4 years ago

In terms of the pretrained model, we should include:

model configuration
vocab + tokenizer
weights

This should be the format used in TF hub: https://tfhub.dev/google/albert_base/2

If the user needs to pretrain from scratch, we will provide another script to do so and teaches the user about how to train your own subword tokenizer + train with your own data.