SKTBrain / KoBERT

Korean BERT pre-trained cased (KoBERT)
Apache License 2.0
1.31k stars 370 forks source link

Tokenizing result doesn't look good. #56

Closed yw0nam closed 3 years ago

yw0nam commented 3 years ago

Here is my codes.

import torch
from kobert.pytorch_kobert import get_pytorch_kobert_model
import numpy as np
import pandas as pd

from gluonnlp.data import SentencepieceTokenizer
from kobert.utils import get_tokenizer

tok_path = get_tokenizer()
sp  = SentencepieceTokenizer(tok_path)

print(train['중식메뉴_processed'][0])
--output: 쌀밥/잡곡밥 오징어찌개 쇠불고기 계란찜 청포묵무침

print(sp(train['중식메뉴_processed'][0]))
--output: ['▁', '쌀', '밥', '/', '잡', '곡', '밥', '▁오', '징', '어', '찌', '개', '▁', 
'쇠', '불', '고', '기', '▁계', '란', '찜', '▁청', '포', '묵', '무', '침']

It doesn't look great. Is there any idea for improve the result?

Thanks.

haven-jeon commented 3 years ago

The performance of the tokenization algorithm is difficult to judge. With wiki documents, it was trained with the sentencepiece algorithm with a vocab size of 8k. Tokenizer is trained by considering word frequency and LM score. First, it seems appropriate to raise an issue along with concerns about how to measure the tokenizer's performance. What is the performance of the KoBERT tokenizer numerically?

haven-jeon commented 3 years ago

Since there is no answer, I am closing the issue.