Closed Geniusssmit closed 4 years ago
It currently isn't, but support can be added. There is a workaround for this issue for supporting pitch accents (#61), and a similar approach could be taken for frequency lists.
However, is there a data source available which has this? A new dictionary would have to be created and imported using such data.
absolutely! I have very good freq lists. I just wanted to import them. I can create them by my self. For example list created from analyzing all Japanese Netflix https://mega.nz/#!gTgTDb7b!a1DGu0gk1d1BqAPNY7XQ2nrRvrBUVV1Ql6hH01aKOAA
or that one
@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text.
@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text.
February 13, 2019. You can download all netflix japanese subs from here https://mega.nz/#F!hgBW0QID!IsLg3YRSdfBJkjkJtzbDJQ
srt format, not OCRed
@Geniusssmit Thanks! I think I'll find some use for those.
I assume that the archive was created by some third party before Netflix switched to image based subtitles as the Kodi plugin introduced in this video doesn't work anymore https://www.youtube.com/watch?v=i2SudOnkiuc. The fork (?) linked in the video has been removed from GitHub, but the author seems to be working on a project that can be used to OCR newer Netflix subtitles https://github.com/Zarxrax/png2srt.
That's cool. I'm actually interested how pitch accent and readings would be implemented
It's in progress, but the main difference is that there is more metadata used to specify the reading for each term/expression. The feature is nearing completion, so after that's done, adding this should be simple.
See: #385
I have created an initial version setting up what you requested based on the pitch accent structure in my frequency-improvements branch, specifically commit https://github.com/toasted-nutbread/yomichan/commit/20de591d410565d064ccf067807b3db1fd8f2064.
For the example in your opening post, the dictionary data would look like this:
[
["方", "freq", {"reading": "ほう", "frequency": 1}],
["方", "freq", {"reading": "かた", "frequency": 2}],
["明日", "freq", {"reading": "あす", "frequency": 3}],
["明日", "freq", {"reading": "あした", "frequency": 4}]
]
Note that the reading is expected to be in hiragana rather than katakana, except when the source term is partially or fully katakana. For example:
[
["アイゴ属", "freq", {"reading": "アイゴぞく", "frequency": 1}],
["あいご属", "freq", {"reading": "あいごぞく", "frequency": 2}] // not a real word
]
Cool! Is it possible to use katakana for readings? All tools made for creating frequency lists use katakana for reading section
Cool! Is it possible to use katakana for readings?
I think the way that Yomichan dictionaries work is that they use hiragana for readings unless the expression contains katakana. So you would probably have to do a conversion for it to work as expected.
@Geniusssmit This feature is now available on the master branch if you want to use that for testing. The format used is as described in https://github.com/FooSoft/yomichan/issues/382#issuecomment-593178227. Let us know if you encounter any issues creating your dictionary data by creating a new issue or reopening.
So you would probably have to do a conversion for it to work as expected.
I tried many sites but unfortunately nothing can handle text as big as frequency list, how can I do that?
You would probably have to write/use a script to do it. Yomichan internally uses https://github.com/WaniKani/WanaKana, so you could use that .
Is it possible to import frequency list similar to:
So different readings of the compound words would show different frequency information. If it's impossible for now, that's my "feature request"