Importing a frequency list with readings

Thermospore commented 4 years ago

https://pj.ninjal.ac.jp/corpus_center/bccwj/en/freq-list.html

The Long Unit Word list is a bit too, well, long. But I think the Short Unit Word list would be a great addition to the Yomichan suggested dictionaries list, as there is not yet a frequency list there with reading data.

It seems support was added for this here: https://github.com/FooSoft/yomichan/pull/450

I looked into importing the data myself, but unfortunately I'm not very familiar with the formatting or import process for Yomichan. Reading through the pull request, it also seems like the readings in the list would need to be converted from katakana to hiragana. But there seem to be edge cases for words that have a combination of katakana and hiragana/kanji

toasted-nutbread commented 4 years ago

Are you looking for information about the data format that Yomichan uses for frequency lists, or something else?

There is a JSON schema describing the format here: https://github.com/FooSoft/yomichan/blob/master/ext/bg/data/dictionary-term-meta-bank-v3-schema.json
There is a node.js script which can validate dictionary .zip files here: https://github.com/FooSoft/yomichan/blob/master/dev/dictionary-validate.js
There is some example dictionary data here: https://github.com/FooSoft/yomichan/tree/master/test/data/dictionaries/valid-dictionary1

Thermospore commented 4 years ago

Apologies, yeah in hindsight I was pretty unspecific. Thanks for the info! I'll see if I can get the list imported and report back.

toasted-nutbread commented 4 years ago

No problem. Yomichan's dictionary JSON format isn't too complicated, the only real caveat (which you already pointed out) is using katakana instead of hiragana, but there are scripts/tools to do those conversions so you don't have to write them yourself; Yomichan uses https://github.com/WaniKani/WanaKana.

Thermospore commented 4 years ago

BCCWJ_short_freq_v0.zip

Got it most of the way there! It successfully imports too. The only remaining thing is to convert the applicable katakana readings to hiragana, which is admittedly beyond my abilities.

Hate to ask, but is there anyone interested in doing the conversion? :^)

toasted-nutbread commented 4 years ago

I can take a look at it sometime soon, please wait warmly.

toasted-nutbread commented 4 years ago

Here is a node script to modify the readings:

const fs = require('fs');
const wanakana = require('./wanakana.min.js');
const input = 'term_meta_bank_1.json';
const output = input.replace(/\.[^\.]*$/, '_adjusted$&');
const data = JSON.parse(fs.readFileSync(input, {encoding: 'utf8'}));
for (const item of data) {
  item[2].reading = wanakana.toHiragana(item[2].reading);
}
fs.writeFileSync(output, JSON.stringify(data, null, 0), {encoding: 'utf8'})

Depends on WanaKana; you can copy the file from here.

I would also recommend adding the additional author/attribution metadata to the index file, as described in this comment: https://github.com/FooSoft/yomichan/issues/834#issuecomment-693698073

Thermospore commented 4 years ago

Big thanks, I got it to work! The only caveat is that, as you mentioned, it seems Yomichan expects certain readings to remain in katakana

For example: キリスト教 -> キリストきょうＡＣ -> エーシー

so words like that don't show up. That's not a big deal, though, as most of those words I wouldn't be looking up the frequency for anyway (not to mention the other sans-reading frequency lists can still catch most of them), so the list is very useable in this state!

Here it is in its current state, with author/attribution data added: BCCWJ_short_freq_v0p5.zip

toasted-nutbread commented 4 years ago

This might get you better coverage; still not perfect, but better for hiragana and latin characters.

const fs = require('fs');
const wanakana = require('./wanakana.min.js');
const input = 'term_meta_bank_1.json';
const output = input.replace(/\.[^\.]*$/, '_adjusted$&');
const data = JSON.parse(fs.readFileSync(input, {encoding: 'utf8'}));
const isPartiallyJapanese = (input) => [...input].reduce((value, char) => value || wanakana.isJapanese(char), false);
for (const item of data) {
  const [expression, , {reading}] = item;
  if (expression === reading) { continue; } // Both in hiragana/katakana
  if (!isPartiallyJapanese(expression.normalize('NFKC'))) { continue; } // Latin/full width characters
  item[2].reading = wanakana.toHiragana(reading);
}
fs.writeFileSync(output, JSON.stringify(data, null, 4), {encoding: 'utf8'})

Thermospore commented 4 years ago

Wow, what a legend. I think the vast majority of entries are covered at this point!

Anyone feel free to use it: BCCWJ_short_freq_v1.zip

Just as a reference for anyone who wants to give it a go, here are two examples of entries that won't show properly:

[ "キングマン・アンド・アイブズ", "freq", { "reading": "きんぐまんあんどあいぶず", "frequency": 152442 } ],
[ "ラジアン毎秒", "freq", { "reading": "らじあんまいびょう", "frequency": 152442 } ],

Lastly, something to keep in mind is that the data source for this frequency list occasionally distinguishes between different parts of speech. For example if you search 切り you will see two frequency entries in Yomichan. Checking the original BCCWJ list will show that one instance is for its use as a suffix and the other for as a noun. Just something to be aware of!

toasted-nutbread commented 4 years ago

I think the real solution will be to eventually implement #461, so that readings are normalized before Yomichan adds them to the internal database. That will likely affect many other things as well, so it's not a simple change, but it should be the most effective one.

toasted-nutbread commented 3 years ago

https://github.com/toasted-nutbread/yomichan-bccwj-frequency-dictionary/releases/tag/1.0.0

Thermospore commented 3 years ago

Nice, I'll check em out! I'm curious, how did you handle the fact that in the long list they split up when a noun is used standalone vs when it is used as a suru verb? (Eg there would be an entry for 勉強 and 勉強する, iirc)

There might be other stuff the long list splits up as well that makes it trickier to reference words as simply as with the short list

toasted-nutbread commented 3 years ago

The long list is not handled any differently; both entries are included in the dictionary.

Thermospore commented 3 years ago

I'll enable all 3 (the old short list, the new short list, and the new long list) for a week or so and look out for any discrepancies!

I suspect the way the long list splits things up / over-specifies might be problematic, as it won't always cross reference properly with the way dictionaries format their entries. Some words will probably appear much less frequent than they actually are, or might not show up at all

For example if you ctrl+f the lists for 席捲 you get the following results:

Short list 20962 席捲 (182 hits)

Long list 22186 席捲する (154 hits) 122640 席捲 (16 hits) 282791 席捲し始める (5 hits)

Even if you are to search 席捲する in yomichan, all the dictionaries will have their entry as 席捲, with the する dropped. The long list would then return a rank of 122640, making the word look considerably rare when it is actually fairly common

Thermospore commented 3 years ago

Yea that seems to be the case AJ = anime & jdrama frequency list/ W = wikipedia IC = innocent corpus Bs = the old short list version from this thread

Thermospore commented 3 years ago

キリスト教 shows up! nice!
I assume it is intentional that only the first instance on the list is included (looking at アラビア数字 and 切り for example)?
words like ＡＩＤＳ, ＨＡＣＣＰ, ＡＣ show up on the old version, but not on the two new ones

toasted-nutbread commented 3 years ago

Even if you are to search 席捲する in yomichan, all the dictionaries will have their entry as 席捲, with the する dropped. The long list would then return a rank of 122640, making the word look considerably rare when it is actually fairly common

That is the expected behaviour when the dictionary doesn't contain an entry for 席捲する, as it is presumably searching for the case when it is used as a noun or something without -suru. I'm not saying the long unit word version is as useful as the short one, as I don't think most of the dictionaries available are in the same format / have all the same compound words, but generation of a dictionary using that data is supported.

I assume it is intentional that only the first instance on the list is included (looking at アラビア数字 and 切り for example)?

Yes, since there is no support for part-of-speech disambiguation currently. Having multiple entries would likely be confusing to anyone who doesn't know why there are multiple values due to how the source information is presented.

words like ＡＩＤＳ, ＨＡＣＣＰ, ＡＣ show up on the old version, but not on the two new ones

This is likely because they have readings that are fully in katakana, whereas the readings in the dictionary are using hiragana. In general, I don't think there's a way to know that this is the case for the base dictionary. For example, it seems that JMDict stores the entries made up exclusively with full width characters in katakana, but partial entries may still use hiragana. Furthermore, other readings may have non-readable characters in them.

{expression: "ＡＩＤ人工授精", reading: "ひはいぐうしゃかんじんこうじゅせい"}
{expression: "ＡＢＣ順", reading: "エービーシーじゅん"}
{expression: "ＡＶ", reading: "エイ・ヴィ"}

Again, this is likely an issue that would need to be resolved at some point on the Yomichan side rather than the dictionary side. The main change is that there is better (but not perfect) coverage for words like キリスト教, with katakana + kanji.

Thermospore commented 3 years ago

Gotcha, thanks for the responses. Interestingly, it looks like they got rid of the ・ in ＡＶ recently http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1958780.1

Thermospore commented 3 years ago

I've actually come to find the long unit list very useful. By comparing the frequency from the short and long unit lists you can infer information, such as if a word tends to be used in isolation or as part of a compound

Taking 席捲 above as an example, the fact that the long unit list returns a significantly lower rank than the other lists indicates this word tends to be used in a compound, not by itself

Initially I assumed the short list was simply an abbreviated version of the long list, but that was obviously missing the point

So yeah I agree it would have been a mistake to try and edit the long unit list to split things up, since that's just what the short unit list is... Thanks!

FooSoft / yomichan

Importing a frequency list with readings #855