Support for frequency lists with readings

Geniusssmit commented 4 years ago

Is it possible to import frequency list similar to:

[["方","ホウ",1],
["方","カタ",2],
["明日","アス",3]
["明日","アシタ",4]]

So different readings of the compound words would show different frequency information. If it's impossible for now, that's my "feature request"

toasted-nutbread commented 4 years ago

It currently isn't, but support can be added. There is a workaround for this issue for supporting pitch accents (#61), and a similar approach could be taken for frequency lists.

However, is there a data source available which has this? A new dictionary would have to be created and imported using such data.

Geniusssmit commented 4 years ago

absolutely! I have very good freq lists. I just wanted to import them. I can create them by my self. For example list created from analyzing all Japanese Netflix https://mega.nz/#!gTgTDb7b!a1DGu0gk1d1BqAPNY7XQ2nrRvrBUVV1Ql6hH01aKOAA

or that one

frequency.txt

siikamiika commented 4 years ago

@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text.

Geniusssmit commented 4 years ago

@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text.

February 13, 2019. You can download all netflix japanese subs from here https://mega.nz/#F!hgBW0QID!IsLg3YRSdfBJkjkJtzbDJQ

Geniusssmit commented 4 years ago

srt format, not OCRed

siikamiika commented 4 years ago

@Geniusssmit Thanks! I think I'll find some use for those.

I assume that the archive was created by some third party before Netflix switched to image based subtitles as the Kodi plugin introduced in this video doesn't work anymore https://www.youtube.com/watch?v=i2SudOnkiuc. The fork (?) linked in the video has been removed from GitHub, but the author seems to be working on a project that can be used to OCR newer Netflix subtitles https://github.com/Zarxrax/png2srt.

Geniusssmit commented 4 years ago

That's cool. I'm actually interested how pitch accent and readings would be implemented

toasted-nutbread commented 4 years ago

It's in progress, but the main difference is that there is more metadata used to specify the reading for each term/expression. The feature is nearing completion, so after that's done, adding this should be simple.

toasted-nutbread commented 4 years ago

See: #385

toasted-nutbread commented 4 years ago

I have created an initial version setting up what you requested based on the pitch accent structure in my frequency-improvements branch, specifically commit https://github.com/toasted-nutbread/yomichan/commit/20de591d410565d064ccf067807b3db1fd8f2064.

For the example in your opening post, the dictionary data would look like this:

[
  ["方", "freq", {"reading": "ほう", "frequency": 1}],
  ["方", "freq", {"reading": "かた", "frequency": 2}],
  ["明日", "freq", {"reading": "あす", "frequency": 3}],
  ["明日", "freq", {"reading": "あした", "frequency": 4}]
]

Note that the reading is expected to be in hiragana rather than katakana, except when the source term is partially or fully katakana. For example:

[
  ["アイゴ属", "freq", {"reading": "アイゴぞく", "frequency": 1}],
  ["あいご属", "freq", {"reading": "あいごぞく", "frequency": 2}] // not a real word
]

Geniusssmit commented 4 years ago

Cool! Is it possible to use katakana for readings? All tools made for creating frequency lists use katakana for reading section

toasted-nutbread commented 4 years ago

Cool! Is it possible to use katakana for readings?

I think the way that Yomichan dictionaries work is that they use hiragana for readings unless the expression contains katakana. So you would probably have to do a conversion for it to work as expected.

toasted-nutbread commented 4 years ago

@Geniusssmit This feature is now available on the master branch if you want to use that for testing. The format used is as described in https://github.com/FooSoft/yomichan/issues/382#issuecomment-593178227. Let us know if you encounter any issues creating your dictionary data by creating a new issue or reopening.

Geniusssmit commented 4 years ago

So you would probably have to do a conversion for it to work as expected.

I tried many sites but unfortunately nothing can handle text as big as frequency list, how can I do that?

toasted-nutbread commented 4 years ago

You would probably have to write/use a script to do it. Yomichan internally uses https://github.com/WaniKani/WanaKana, so you could use that .

FooSoft / yomichan

Support for frequency lists with readings #382