JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Kanji check project #119

Open razasyedh opened 5 months ago

razasyedh commented 5 months ago

A while back I cross-checked the readings of our entries against Daijirin (and later GG5 & Meikyō). I had the thought of also checking for any possible incorrect kanji. So I matched on readings that only show up once.

Here are the 923 results that differ in the kanji used for these readings, with the JMDict kanji followed by the Daijirin one:

mismatches.txt

As expected, false positives abound. There are also many cases of differing okurigana usage. As well as cases where substitute kanji are used.

I'll be working through the bunch over time, but mentioning it here for context and if there are any concerns. The issue can be closed after a short while if nothing major comes up.

JMdictProject commented 5 months ago

Interesting, but obviously needs some work. The first 4 were false positives. The fifth was useful.

razasyedh commented 5 months ago

Agreed, there are tons of homonyms, so this naive approach won't be accurate. (but we'll see how fruitful it ends up being)

In the future, I want to run our entries through Mecab/Unidic to see what that would expect the readings to be, but will likely have to somehow account for rendaku, gemination, etc. Using semantic information from this or even Kanjidic would be more precise.