VinAIResearch / PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
MIT License
636 stars 92 forks source link

BPE Tokenizer #38

Closed Quang-elec44 closed 2 years ago

Quang-elec44 commented 2 years ago

Hi, i'm having a problem with your bpe tokenizer. Giving the token 'Trần_văn_thời', i used AutoTokenizer.from_pretrained('vinai/phobert-base', usefast=False) to convert this token into ids and the results were [1359, 3, 8915], which is [ 'Trần@@', '', 'ời']. However, when i changed the token into 'Trần_vănThời', i got [1359, 16398, 4834,], which is ['Trần@@', 'văn_@@', 'Thời']. Another example is that the token 'Lê_vănTám' when tokenized will gave the result of [1475, 16398, 6813], which is ['Lê@@', 'văn_@@', 'Tám']. So, obviously the result for 'Trần_văn_thời' is probably wrong. Can you give me any explanation for this? Thank you very much.

datquocnguyen commented 2 years ago

No, it's not. [1359, 3, 8915] in fact is [ 'Trần_@@', "unkn", '', 'ời']. I find no issue with this output as "văn_th" is an unknown subword.