VinAIResearch / PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
MIT License
651 stars 92 forks source link

[do not merge] Add PhoBertTokenizer support VnCoreNLP #1

Closed Luvata closed 4 years ago

Luvata commented 4 years ago

I'm doing experiment with PhoBERT today, and the results for my tasks are pretty good Opening this PR for a hacky way to run VnCoreNLP tokenizer with PhoBERT, then you can get embedding vector directly by fairseq or transformers. Set vncore=False if your input is already tokenized by VnCoreNLP

from hacky_phobert_tokenizer import PhoBertTokenizer

tokenizer = PhoBertTokenizer(vncore=False)
sentence = "Tôi là sinh viên trường đại học Công nghệ"  

tokens = tokenizer.encode(sentence)  # tensor([   0,  218,    8,  418, 1430,  212, 2919,  222, 3344, 5116,    2])
print(tokenizer.decode(tokens, remove_underscore=False))  #  Tôi là sinh viên trường đại học Công nghệ

tokenizer.vncore = True  # using VnCoreNLP word tokenizer
tokens = tokenizer.encode(sentence)  # tensor([   0,  218,    8,  649,  212,  956, 2413,    2])
print(tokenizer.decode(tokens, remove_underscore=False))  #  Tôi là sinh_viên trường đại_học Công_nghệ