WorksApplications / SudachiDict

A lexicon for Sudachi
233 stars 19 forks source link

Contains many hangeul terms in notcore_lex.csv #36

Open hanya opened 3 years ago

hanya commented 3 years ago

There are some hungeul terms can be found in notcore_lex.csv file. Such as follows:

전범국,4785,4785,22000,전범국,名詞,固有名詞,一般,*,*,*,センパンコク,戦犯国,*,A,*,*,*,*
전지충이,4785,4785,22000,전지충이,名詞,固有名詞,一般,*,*,*,チョンジチュンイ,デンヂムシ,*,A,*,*,*,*
전툴라,4785,4785,22000,전툴라,名詞,固有名詞,一般,*,*,*,チョントゥラ,チョントゥラ,*,A,*,*,*,*

Are they intentionally contained?

sakamoto-mi commented 3 years ago

Thank you for your inquiry.

In Sudachi dictionary, three types of words are registered. That is, ・words from UniDic ・words from NEologd ・words we collected Hangeul terms were contained in NEologd. Regarding UniDic words and NEologd words , we have not scrutinized them in particular so far. Looking at registered Hangeul terms, most of the them are Pokemon names. As Hangeul is written in katakana in Japanese sentences, we are considering removing them.