I'm doing experiment with PhoBERT today, and the results for my tasks are pretty good
Opening this PR for a hacky way to run VnCoreNLP tokenizer with PhoBERT, then you can get embedding vector directly by fairseq or transformers. Set vncore=False if your input is already tokenized by VnCoreNLP
from hacky_phobert_tokenizer import PhoBertTokenizer
tokenizer = PhoBertTokenizer(vncore=False)
sentence = "Tôi là sinh viên trường đại học Công nghệ"
tokens = tokenizer.encode(sentence) # tensor([ 0, 218, 8, 418, 1430, 212, 2919, 222, 3344, 5116, 2])
print(tokenizer.decode(tokens, remove_underscore=False)) # Tôi là sinh viên trường đại học Công nghệ
tokenizer.vncore = True # using VnCoreNLP word tokenizer
tokens = tokenizer.encode(sentence) # tensor([ 0, 218, 8, 649, 212, 956, 2413, 2])
print(tokenizer.decode(tokens, remove_underscore=False)) # Tôi là sinh_viên trường đại_học Công_nghệ
I'm doing experiment with PhoBERT today, and the results for my tasks are pretty good Opening this PR for a hacky way to run VnCoreNLP tokenizer with PhoBERT, then you can get embedding vector directly by fairseq or transformers. Set
vncore=False
if your input is already tokenized by VnCoreNLP