BPE Tokenizer - Githubissues

Hi, i'm having a problem with your bpe tokenizer. Giving the token 'Trần_văn_thời', i used AutoTokenizer.from_pretrained('vinai/phobert-base', usefast=False) to convert this token into ids and the results were [1359, 3, 8915], which is [ 'Trần@@', '', 'ời']. However, when i changed the token into 'Trần_vănThời', i got [1359, 16398, 4834,], which is ['Trần@@', 'văn_@@', 'Thời']. Another example is that the token 'Lê_vănTám' when tokenized will gave the result of [1475, 16398, 6813], which is ['Lê@@', 'văn_@@', 'Tám']. So, obviously the result for 'Trần_văn_thời' is probably wrong. Can you give me any explanation for this? Thank you very much.

VinAIResearch / PhoBERT

BPE Tokenizer #38