JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Words in recently published dictionaries that are not included in JMdict #101

Open stephenmk opened 9 months ago

stephenmk commented 9 months ago

I have compiled three lists of words that are not included[^1] in JMdict.

[^1]: I considered a word to be "not included" in JMdict if none of its readings or surface forms could be found in today's JMdict file, excluding the JMnedict entries.

  1. Words from Iwanami Kokugo Jiten 8th edition (2019): iwakoku8.csv
  2. Words from Sanseido Kokugo Jiten 8th edition (2021): sankoku8.csv (updated 2023/09/23)
  3. Words from Shinmeikai Kokugo Jiten 8th edition (2020): smk8.csv

Given the focused nature of these three dictionaries (compared to larger dictionaries and encyclopedias) and their recent publication dates, I think these words are of high interest to the JMdict project. Iwakoku is known for being particularly conservative about its selections, and it advertises itself as containing only words that will be relevant for the next 100 years. Sankoku is well known for its focus on contemporary language.

Each list is a comma-separated CSV file with 7 columns.

col. name description
1 eid unique identifier for the entry
2 type entry category (PhraseEntry, KanjiEntry, ChildEntry[^2], etc.)
3 reading semicolon-separated list of readings
4 surface semicolon-separated list of surface forms
5 smk_eid semicolon-separated eids of overlapping[^3] smk8 entries
6 san_eid semicolon-separated eids of overlapping sankoku8 entries
7 iwa_eid semicolon-separated eids of overlapping iwakoku8 entries

[^2]: A compound word. For example, iwakoku8 has 叫喚地獄 as a child entry of 叫喚.

[^3]: Entries are considered to "overlap" if they share one or more readings or surface forms. The smk_eid column is empty in the smk8 file, and likewise for the other files. In other words, I didn't allow a dictionary to overlap with itself.

Surface form methodology

Not all of the surface forms listed in the files actually appear in their respective dictionaries. For example, iwakoku8 only has "搔い繕う," but I added "掻い繕う" to its list of surface forms by using a small list of kanji variations. These additions help when searching for overlaps with other dictionaries and will also return better results from the n-gram counter.

Similarly, I added iteration marks (々) to surface forms with repeated kanji. I did this because iwakoku8 doesn't use these marks ("時時"), but smk8 and sankoku8 do use them ("時々"). Adding these marks helps find overlaps, but there are some unnatural additions to watch out for (like "平々坦坦" in the entry for "平平坦坦").

Data quality

Not all of these words are suitable for JMdict. "Words" from kanji entries (漢和) in particular are not of any use, although they are clearly labeled as such in column #2. Words that are simply compounds of the honorific prefix お are probably also not of much interest. [on-mim] words ending in と can also probably be disregarded. And so on.

If we want to rank these words by n-gram counts, I think it should be straightforward to feed the surface forms (column #4) into a lookup tool. We may also choose to prioritize words with the most overlaps with other dictionaries.

Statistics

file name entry count
iwakoku8.csv 3130
sankoku8.csv 6264
smk8.csv 5042

See Also

Related discussion in an old issue: https://github.com/JMdictProject/JMdictIssues/issues/52

Updates

2023/09/23: Fixed a dozen or so surface forms in the sankoku8 list that included erroneous "=" characters.

stephenmk commented 9 months ago

https://stephenmk.github.io/jmdict-reiwa/expansion.html (warning: it's a big page -- 5.4 MiB -- that may not load well on mobile devices)

I added more detailed part-of-speech info for each entry as well as n-gram counts for about 10,000 surface forms. The information is displayed in three tables (one for each dictionary) in the above link. It's pretty bare at the moment, but I intend to make it at least a little prettier sometime soon.

The entries are grouped by their first part-of-speech tag and ordered by the n-gram counts in descending order.

I'm planning to set up a job to update the page content daily. If it works correctly, links to JMdictDB should appear in the "JMdict" column as new entries are added.

I didn't gather n-gram counts for kanji entries, long phrase entries, and verbs. I'm thinking for verbs (about 1000 forms) we can get an n-gram count for the sum of all their inflections (or maybe that's overkill?). I could set up a program to automatically fetch those numbers from the web server, but that's not high priority at the moment.


Edit: I updated it today (2023/09/29) and it's picking up the newly added entries as expected.

update

Marcusjmdict commented 9 months ago

God's work, Stephen!

stephenmk commented 8 months ago

I made a google sheets version of the data if people would like to collaborate that way. https://docs.google.com/spreadsheets/d/1ie3ReEV1znjwhPM3Df5rnjiDcOt99nrTnq78PHjvBGs/edit?usp=sharing

Everyone in the edict-jmdict google group should have permission to leave comments on cells in the spreadsheet.

I also went ahead and added n-gram counts for the remaining ~2500 forms (verbs and expressions) that I didn't look up earlier.

JMdictProject commented 6 months ago

Just had a quick look at this. Very interesting. If I get a chance I'll run a match on the list of GG5 entries (about 130k). Might get some useful entries.

I see it has:

I'm sure there'll be others like that.

JMdictProject commented 6 months ago

I matched the forms in Stephen's file with the headwords in GG5. There were ~3,300 matches from the ~14,400 lines in the file. I'll see if I can merge them somehow.

I see it picked up several of them multiple times as it appears there are some duplicates, e.g.

In fact, there were only about 2,200 unique matches.

stephenmk commented 6 months ago

I see it picked up several of them multiple times as it appears there are some duplicates

The full list is a combination of matches from three different dictionaries. So 脇挟む is found three times in the data because it is recorded in all three dictionaries, 話調 matched twice because it is in sankoku and shinmeikai (but not iwakoku), etc. The three-letter prefix of the row IDs indicate which dictionary the record belongs to.