WorksApplications / SudachiDict

A lexicon for Sudachi
233 stars 19 forks source link

core_lex.csv and notcore_lex.csv have \u**** characters #35

Closed utuhiro78 closed 3 years ago

utuhiro78 commented 3 years ago

Hello,

core_lex.csv and notcore_lex.csv have \u**** characters. I checked them with ripgrep on Arch Linux.

rg '\\u' core_lex.csv > core_lex_broken_entries.txt
rg '\\u' notcore_lex.csv > notcore_lex_broken_entries.txt

Examples.

# core_lex_broken_entries.txt
納付書・領収\u0028納付受託\u0029証書,5133,5146,32767,納付書・領収\u0028納付受託\u0029証書,名詞,普通名詞,一般,*,*,*,ノウフショ・リョウシュウ\u0028ノウフジュタク\u0029ショウショ,納付書・領収\u0028納付受託\u0029証書,*,C,617408/506627/268971/747421/784703/617408/338603/784704/680506,1462768/268971/747421/784703/617408/338603/784704/680506,1462768/268971/747421/784703/617408/338603/784704/680506,*

# notcore_lex_broken_entries.txt
バジルドライ,5144,5671,5157,バジルドライ,名詞,固有名詞,一般,*,*,*,バジルドライ,バジル\u0028ドライ\u0029,*,C,233848/227036,233848/227036,233848/227036,*

Are they OK?

Thank you for providing a big dictionary.

kazuma-t commented 3 years ago

These expressions are decoded when a dictionary is built. https://github.com/WorksApplications/Sudachi/blob/c4a363ad1a092892d79e43475aefcb4105d18d64/src/main/java/com/worksap/nlp/sudachi/dictionary/DictionaryBuilder.java#L403

utuhiro78 commented 3 years ago

These expressions are decoded when a dictionary is built.

Thanks! I didn't know the dictionary needed to be built.