SKT-AI / KoBART

Korean BART
Other
446 stars 94 forks source link

Dropping <unk> token #6

Closed haven-jeon closed 3 years ago

haven-jeon commented 3 years ago

아래와 같이 토큰을 누락시키는 버그가 존재함.

>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> kobart_tokenizer.tokenize("ab헣㉿cde")
['▁', 'ab', 'c', 'd', 'e']
haven-jeon commented 3 years ago

6b6753fbfa0197fa418c3600e3f8aeb43d1675ca

>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
using cached model
>>> kobart_tokenizer.tokenize("ab헣㉿cde")
['▁', 'ab', '<unk>', '<unk>', 'c', 'd', 'e']