Closed cessen closed 3 years ago
Hello, thanks for working on yomichan dictionary conversion support. There is one feature I want to request before you merge this. Yomichan actually supports special pitch accent dictionary. I have found a good one with data extracted from NHK accent dictionary + 大辞林3.0 which you can download from this link. Since this dictionary has significantly more entry than the provided accents.tsv
file, I wish to be able to use it as a reference when converting the dictionaries with --pitch_accent
flag. Could you please take a look at this? Thank you.
@Rudo2204 I appreciate the request, but I don't think I'm going to implement support for that.
My main reason is that the included accent.tsv
file already has over 120,000 entries. (And pitch accent dictionaries aren't like other dictionaries: you're not going to get useful alternative definitions from a different dictionary—it's all pretty much the same data.) For comparison, native adult speakers of most languages have a passive vocabulary between 20,000-50,000 words (depending on which studies you look at), so 120,000 is very likely to cover most words you encounter while reading. Certainly as a second-language learner, anyway.
So I think there's very limited (if any) real utility in an even larger accent dictionary, and I don't really want to spend my time implementing support for something with such limited utility.
I'm also not sure I'd even accept a PR for it. I really want this project to keep its focus on things that provide obvious practical utility to people who are learning Japanese (like myself), rather than chasing some idea completeness for its own sake. There is definitely still more progress that can be made in that direction, but I don't think a larger pitch accent dictionary is one of them.
Having said all of that, I made this project open source for a reason. If you want to create your own fork that goes in a different direction, you're more than welcome to do so. And I would honestly be really happy if you did: having more options out there for people is a good thing! Also, if you just want to use that specific dictionary, I don't think it would be very hard to write a script to convert it into tsv format, so you could give that a try.
@cessen Thank you for the quick response. I totally agree that the yomichan pitch accent support is only for the completeness sake of the program. I will try to poke around over my fork as a programming exercise then. Thanks again for making this project open source for me to even attempt this.
@Rudo2204 No worries! And I hope my response wasn't too harsh--I just wanted to explain my reasoning. I don't think it's a bad idea at all, just not something that fits my personal goals for this project.
If you want any pointers, feel free to shoot me an email. The code base is honestly kind of one big hack right now (I've just been trying to get things working, rather than architect something nice), so I'd be happy to help you navigate that a bit as my time/energy allows.
Edit: and actually, one thought I had was that it might be useful to split off the dictionary generation part of the code into a separate crate, so that other people could more easily use it to build their own dictionary-building tools. I also have a vague hope to remove the external dependency on the marisa-build
executable eventually, though that might require a good bit of work. So if you're looking for ways to make contributions (not sure if you are, and if not, no worries!), either of those would be great. I'll make issues for both of those.
Add support for Yomichan dictionaries, including:
Some basic filtering of the definition text is attempted, to prevent: