cessen / kobo_jp_dict

A Japanese-English dictionary builder for Kobo e-readers.
Apache License 2.0
32 stars 2 forks source link

Yomichan support #2

Closed cessen closed 3 years ago

cessen commented 3 years ago

Add support for Yomichan dictionaries, including:

Some basic filtering of the definition text is attempted, to prevent:

  1. Redundant headers from appearing in the definitions (we already have our own entry headers).
  2. Japanese -> English entries from Japanese-Japanese dictionaries (yes, this exists), since that's likely not what people want from a Japanese-Japanese dictionary. If the user wants Japanese-English entries, they can include an actual Japanese-English dictionary.
Rudo2204 commented 3 years ago

Hello, thanks for working on yomichan dictionary conversion support. There is one feature I want to request before you merge this. Yomichan actually supports special pitch accent dictionary. I have found a good one with data extracted from NHK accent dictionary + 大辞林3.0 which you can download from this link. Since this dictionary has significantly more entry than the provided accents.tsv file, I wish to be able to use it as a reference when converting the dictionaries with --pitch_accent flag. Could you please take a look at this? Thank you.

cessen commented 3 years ago

@Rudo2204 I appreciate the request, but I don't think I'm going to implement support for that.

My main reason is that the included accent.tsv file already has over 120,000 entries. (And pitch accent dictionaries aren't like other dictionaries: you're not going to get useful alternative definitions from a different dictionary—it's all pretty much the same data.) For comparison, native adult speakers of most languages have a passive vocabulary between 20,000-50,000 words (depending on which studies you look at), so 120,000 is very likely to cover most words you encounter while reading. Certainly as a second-language learner, anyway.

So I think there's very limited (if any) real utility in an even larger accent dictionary, and I don't really want to spend my time implementing support for something with such limited utility.

I'm also not sure I'd even accept a PR for it. I really want this project to keep its focus on things that provide obvious practical utility to people who are learning Japanese (like myself), rather than chasing some idea completeness for its own sake. There is definitely still more progress that can be made in that direction, but I don't think a larger pitch accent dictionary is one of them.

Having said all of that, I made this project open source for a reason. If you want to create your own fork that goes in a different direction, you're more than welcome to do so. And I would honestly be really happy if you did: having more options out there for people is a good thing! Also, if you just want to use that specific dictionary, I don't think it would be very hard to write a script to convert it into tsv format, so you could give that a try.

Rudo2204 commented 3 years ago

@cessen Thank you for the quick response. I totally agree that the yomichan pitch accent support is only for the completeness sake of the program. I will try to poke around over my fork as a programming exercise then. Thanks again for making this project open source for me to even attempt this.

cessen commented 3 years ago

@Rudo2204 No worries! And I hope my response wasn't too harsh--I just wanted to explain my reasoning. I don't think it's a bad idea at all, just not something that fits my personal goals for this project.

If you want any pointers, feel free to shoot me an email. The code base is honestly kind of one big hack right now (I've just been trying to get things working, rather than architect something nice), so I'd be happy to help you navigate that a bit as my time/energy allows.

Edit: and actually, one thought I had was that it might be useful to split off the dictionary generation part of the code into a separate crate, so that other people could more easily use it to build their own dictionary-building tools. I also have a vague hope to remove the external dependency on the marisa-build executable eventually, though that might require a good bit of work. So if you're looking for ways to make contributions (not sure if you are, and if not, no worries!), either of those would be great. I'll make issues for both of those.