FooSoft / yomichan-import

External dictionary importer for Yomichan.
https://foosoft.net/projects/yomichan-import/
MIT License
83 stars 23 forks source link

New version of JMnedict (the proper name dictionary) #41

Closed stephenmk closed 1 year ago

stephenmk commented 1 year ago

This pull request is to redesign the format of the JMnedict dictionary for Yomichan. It also includes a fix for a part-of-speech tag problem in non-English versions of JMdict.

New version of JMnedict

Related issue: https://github.com/FooSoft/yomichan/issues/2111

Unlike the new version of JMdict, this redesign does not add new information or use any of Yomichan's new structured content features. It simply redesigns how the information is presented to users.

JMnedict contains a daunting number of entries that surpasses even JMdict. There are generally two types of entries in the file: (1) specific names of people, companies, events, etc., and (2) generic names such as given names and surnames. The latter category far outnumbers the former.

While the entries for specific names often provide useful information and context for a given term, the entries for generic names do not. The glossaries for generic names simply transliterate the term into Latin characters. So for example, the JMnedict entry for おおたに【大谷】 simply contains the gloss "Ootani" along with "place" and "surname" tags.

The problem is that JMnedict contains 44 generic name entries for the kanji 大. This means that anytime a Yomichan user searches for a word beginning with 大, Yomichan will also retrieve all 44 generic name entries for 大. This clutters the search results with a large amount of low quality information.

My suggestion is that we discard all glosses in generic entries with kanji forms. This way we can merge all generic entries sharing the same kanji form into single Yomichan entries.

Example: 尚三郎 (readings are moved to the glossaries for generic kanji terms) ![尚三郎](https://user-images.githubusercontent.com/8003332/216415941-c09f88f4-ffea-4fe2-af0f-2c0425277bed.png)
Example: 山海経 (specific name entries retain glosses) ![山海経](https://user-images.githubusercontent.com/8003332/216412552-e27381a4-44b3-4605-bf6b-23830097a4a5.png)
Example: 大谷海岸駅 (all 44 generic 大 entries merged into one) ![大谷海岸駅](https://user-images.githubusercontent.com/8003332/216411564-16246f1b-4117-40f8-8fc7-5eb3b7bbd3b2.png)
Example: 林佳樹 (gloss is technically a transliteration but is retained because it has a space) ![林佳樹](https://user-images.githubusercontent.com/8003332/216412217-b90947c0-0903-4dea-a120-ae9dd82675d3.png)
Example: じゅりあん (glosses are retained because they are not all transliterations) ![julian](https://user-images.githubusercontent.com/8003332/216412933-5c3f87c2-c3f9-495c-a4a1-8e646b6e8036.png)

JMdict: missing part-of-speech tags

I noticed that non-English versions of the new JMdict dictionaries did not have part-of-speech tags, unlike the old versions.

Only English-language senses in JMdict contain part-of-speech tags. The old version of Yomichan-Import took the PoS tags from the final sense in the English version of an entry and applied them to every sense of every other language. For example, 川・かわ has two senses in English JMdict: a noun sense and a suffix sense. Therefore every sense of 川・かわ in every other language was tagged as a suffix.

Instead, I suggest gathering all distinct part-of-speech tags from each English entry and applying them all to each non-English sense. Every non-English sense of 川・かわ will therefore be tagged as both a noun and suffix. This still isn't ideal, but I think this is at least an improvement on the previous setup.

Test Dictionary Builds

Thermospore commented 1 year ago

nice! yea currently I have jmnedict in its own profile with a different key to trigger it, cos it clutters things up. I'll have to try this out

one potential problem I see is that you can't do a kana -> kanji search for some entries. ex if you heard "おおやかいがん" and looked it up, this entry wouldn't show up image

hopefully your ime or even just google could help you out in cases like this, but it is a bit of a regression

stephenmk commented 1 year ago

That is doable, but it's a tradeoff between utility and bloat. Adding kana-to-kanji lookups doubles the size of the term database, and I'm not sure if that functionality is actually useful.

I made a version like this last year if you'd like to try installing it and see for yourself: https://github.com/FooSoft/yomichan/issues/2111#issuecomment-1192238540

Example: よしたけ ![yoshitake](https://user-images.githubusercontent.com/8003332/180377976-a731f08b-5401-4b39-a320-41f15477726f.png)

I've been using the version without the kana-to-kanji terms for about six months now and never found myself wishing for that functionality.

Thermospore commented 1 year ago

another issue I just noticed is if the reading is removed, freq dicts with readings (ex bccwj, B長 in my screenshot) wouldn't function anymore image

maybe a yomichan change could allow for clean/compacted jmnedict entries while still allowing for kana searches and freq dicts with readings. (might even be some overlap with the changes described in this thread to allow for cleaner / more compact viewing of kanji/kana combinations)

tangentially related: I keep forgetting that modes other than group term-reading pairs exist... is there any reason not to use it? It might be better to just remove the other modes from yomichan, and focus on improving grouped mode. instead of trying to finangle grouped mode-esque functionality into the other modes, from the dictionary creation end

removing everything but grouped mode would also streamline development / testing / troubleshooting, since you'd have 1 less dimension of modes to worry about. maybe this could use its own thread on the yomichan repo...

thanks for reading, let me know your thoughts on this!

FooSoft commented 1 year ago

Looking good!

stephenmk commented 1 year ago

@FooSoft, thanks again for your time.

@Thermospore, it is indeed an issue that JMnedict contains no frequency information. For example, 若槻 might be read 「わかつき」 the vast majority of the time, but this isn't evident by looking at JMnedict. I actually mentioned this to the JMdict editors last year, although I didn't have any good solutions at the time. You made a good point that the BCCWJ frequency list could be used for this purpose. I just proposed this idea to the editors, and Dr. Breen agrees that it sounds promising.

If and when this frequency information is adapted and added to JMnedict, I can update the Yomichan dictionary to include standard expression + reading terms for names that are included in the BCCWJ list. This will allow frequency lists, pitch accent lists, flashcards, etc., to function normally.

Thermospore commented 1 year ago

If and when this frequency information is adapted and added to JMnedict, I can update the Yomichan dictionary to include standard expression + reading terms for names that are included in the BCCWJ list. This will allow frequency lists, pitch accent lists, flashcards, etc., to function normally.

thanks for the response, sure that sounds like a good stopgap

next week when I have time, I'll make a thread on the yomichan repo about grouping modes, which would address the core of the issue

basically, I think grouped mode should be default (and various improvements / changes made), and have the other modes just be discontinued / hidden in advanced settings

probably 99% of people using a non grouped mode are just using it because it is default, or because of a feature it has which could just be implemented in grouped mode

the other modes are just holding things back, I think. I don't think grouped mode functionality should have to be finangled into all the modes, from the dictionary end: image

it should all just be one mode