JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Questionable yojijukugo #72

Closed stephenmk closed 2 years ago

stephenmk commented 2 years ago

I've put together a list of JMdict entries which contain the [yoji] miscellaneous tag on one or more senses but do not contain any surface forms that can be found in jitenon's yoji dictionary.

The list: https://gist.github.com/stephenmk/0dcb1318ec60bb35045e75b062d74be4

Goo Jisho also hosts content from gakken and shinmeikai-branded yoji dictionaries. I have some data that was scraped from these online dictionaries about a year ago, and I was also unable to find any of the above surface forms in either of these datasets. I can't say with certainty that these datasets are comprehensive, however.

The list is unfortunately pretty long. At least 1138 of 2730 of our yoji entries do not contain any surface forms that may be found in the jitenon dictionary. I'm not sure we'd want to remove the yoji tag from all of them, but there are far too many to review individually. So there may not be much we can do with this information.

There are six entries in the list which have priority tags, so maybe we should at least consider removing those yoji tags:

sequence surface form
1232400 拒絶反応
1307250 四捨五入
1321540 実力行使
1595050 暑中見舞
1703710 専守防衛
2029860 意思疎通
JMdictProject commented 2 years ago

I think the best thing is to remove the [yoji] tag from all 1138. I sampled a dozen or so, and as expected the tags were added because the terms were in Kanji Haitani's yoji list. That list turned out to be rather, er, flawed. I can do the tag removal as a bulk-edit process, so I'll add the task to my "get a roundtuit" list.

robinjmdict commented 2 years ago

Thanks for generating the list, Stephen. Good thinking to use the jitenon site.

I think the best thing is to remove the [yoji] tag from all 1138.

I agree.

I was able to find a few that are included in other yojijukugo dictionaries or online lists (e.g. 才気縦横, 翻然大悟, 文武不岐) but it's clear that the vast majority should not have the tag.

JMdictProject commented 2 years ago

Yes, many thanks to Stephen for that list. I have run the update now (removing from the list the 3 Robin mentioned.) I'll close the issue for now.