VinAIResearch / PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
MIT License
636 stars 92 forks source link

Can we use PhoBERT-base Tokenizer for PhoBERT-large model and vice versa? #44

Closed ithieund closed 1 year ago

ithieund commented 1 year ago

Hi @datquocnguyen , Both PhoBERT-base Tokenizer and PhoBERT-large Tokenizer have the same vocab size = 64001. Then the question is can we use PhoBERT-base Tokenizer for PhoBERT-large model and vice versa? It means can we use only one of them to tokenize the dataset and use that prepared tokenized tensor dataset to do fine-tuning on downstream tasks for both PhoBERT-base Tokenizer and PhoBERT-large to save preparing time :)

datquocnguyen commented 1 year ago

Both pre-trained PhoBERT-base and PhoBERT-large models use the same tokenizer PhobertTokenizer.

ithieund commented 1 year ago

Thank you!