ENAMDICT/JMnedict clutter

stephenmk commented 2 years ago

The current version of the JMnedict dictionary for yomichan is somewhat notorious for cluttering users' workspaces with terms. For example, a search for ひろこ pulls up over 30 term-reading pairs.

I'm wondering if everyone would be on-board with an update to this dictionary in which many of these personal name terms are consolidated.

For example, a search for ろはん would bring up a term containing all of the the relevant kanji forms in the glossary (24 of them) instead of 24 individual term-reading pairs. A search for 紗子 would bring up a term with all possible readings (6 of them) in its glossary instead of 6 different terms.

If this sounds good, I can begin working on updating yomichan-import to produce this new version of the dictionary.

(This issue is technically with yomichan-import, but I'm posting here because it's the more active repo.)

Here's a list of codes currently used in JMnedict. I'm thinking "fem", "given", "masc", "surname", and "unclass" are the relevant categories that should be consolidated, and possibly also "person" and "oth" depending on how they look after I do some research.

So if a particular name belongs to more than one of those categories, then the consolidated term would have one "sense" for each category (with the appropriate tag), and the sense would contain a gloss with a semicolon delimited list of the relevant readings or kanji forms of the name.

JMnedict code table

|code| description | |---|---| | char| "character"| | company| "company name"| | creat| "creature"| | dei| "deity"| | doc| "document"| | ev| "event"| | fem| "female given name or forename"| | fict| "fiction"| | given| "given name or forename, gender not specified"| | group| "group"| | leg| "legend"| | masc| "male given name or forename"| | myth| "mythology"| | obj| "object"| | organization| "organization name"| | oth| "other"| | person| "full name of a particular person"| | place| "place name"| | product| "product name"| | relig| "religion"| | serv| "service"| | station| "railway station"| | surname| "family or surname"| | unclass| "unclassified name"| | work| "work of art, literature, music, etc. name"|

Thermospore commented 2 years ago

Personally I put jmnedict in a different profile and assigned it a different hotkey, so it wouldn't clutter up my main dictionaries

stephenmk commented 2 years ago

I have also disabled it in my main profile, but I wish I didn't have to. So it's my hope that this update would solve that problem.

stephenmk commented 2 years ago

Here are some mockups of what I'm imagining (click the summaries to expand the images)

Query for 伊勢原八幡台石器時代住居跡

![ise2](https://user-images.githubusercontent.com/8003332/162091841-ffa359a7-09c5-49ba-a7ba-934c793b616a.png)

Query for いせはらはちまんだいせっきじだいじゅうきょあと

![ise](https://user-images.githubusercontent.com/8003332/162091908-0afb3b94-966a-40ca-ba26-d19a56ff5df9.png)

(The only definition for 伊勢原八幡台石器時代住居跡 in JMdictDB is "Iseharahachimandaisekkijidaijuukyoato")

Query for はるか

![haruka_kana](https://user-images.githubusercontent.com/8003332/162091987-932202f9-1413-4f2c-9d65-154d1f7d7c4d.png)

Query for 春香

![haruka_kanji](https://user-images.githubusercontent.com/8003332/162092040-92b9f2e8-46b1-4dfc-a6d4-9143e8a85eec.png)

What I'm discovering is that JMnedict contains two kinds of entries: those with glosses that merely transcribe the name into latin characters (generally generic name entries -- given names, surnames, etc.), and those that have more details (specific people, famous places, brands, etc.). The former category represents the overwhelming majority of entries. I want to consolidate those entries (as pictured above) while leaving the other entries in the same format as they are in the current yomichan dictionary.

It might also be worthwhile to split these two categories into two dictionary files. I imagine more people would be interested in a lightweight dictionary file with these more specific entries.

So the only challenge here is devising a way to determine whether or not an entry's gloss is merely a transcription of the corresponding kana. I tried using this golang library, but it doesn't seem robust enough to handle many situations (characters with macrons like ō, ん written as n', アイ written as "ay", etc). So I'd need to implement a new comparison tool.

Thoughts? Does this sound interesting to anyone?

MarvNC commented 2 years ago

Been interested in this dictionary for a while, might you have a testing version or something available to try? I wouldn't really mind having some random latin character names as it seems it would still be a huge improvement in the amount of clutter.

stephenmk commented 2 years ago

The code I made for this is quite a mess, so I haven't published it anywhere.

As I explained in my post above, my first prototype contained three kinds of entries:

The same sort of normal entries that you can find in the current version of JMnedict for Yomichan. I.e., kanji headwords, kana readings, and English-language glosses. These are usually entries for specific people, companies, and organizations.
Kanji-to-kana lookups for generic names.
Kana-to-kanji lookups for generic names.

I'm not so sure about how useful this third category is. Most of these entries look like a giant mess of kanji.

Example: よしたけ

![yoshitake](https://user-images.githubusercontent.com/8003332/180377976-a731f08b-5401-4b39-a320-41f15477726f.png)

I've uploaded two versions of the test dictionary: one which contains these kana-to-kanji lookups, and one which does not. The former is about 50% larger than the latter, but it doesn't take too much longer to import into Yomichan (in a clean environment with no other dictionaries installed, anyway). So maybe it's not so bad. Let me know what you think.

Full version with kana-to-kanji lookups: jmnedict_2022_07_22_with_kana.zip

Smaller version without the kana lookups: jmnedict_2022_07_22.zip

MarvNC commented 2 years ago

I've been using the full version for a few days, it works great. No complaints really, it reduces clutter by a lot. I'm not sure if the kana lookups help but they don't hurt to have. Thanks for creating these dictionaries!

FooSoft / yomichan

ENAMDICT/JMnedict clutter #2111