Closed Thermospore closed 3 years ago
Are you looking for information about the data format that Yomichan uses for frequency lists, or something else?
Apologies, yeah in hindsight I was pretty unspecific. Thanks for the info! I'll see if I can get the list imported and report back.
No problem. Yomichan's dictionary JSON format isn't too complicated, the only real caveat (which you already pointed out) is using katakana instead of hiragana, but there are scripts/tools to do those conversions so you don't have to write them yourself; Yomichan uses https://github.com/WaniKani/WanaKana.
Got it most of the way there! It successfully imports too. The only remaining thing is to convert the applicable katakana readings to hiragana, which is admittedly beyond my abilities.
Hate to ask, but is there anyone interested in doing the conversion? :^)
I can take a look at it sometime soon, please wait warmly.
Here is a node script to modify the readings:
const fs = require('fs');
const wanakana = require('./wanakana.min.js');
const input = 'term_meta_bank_1.json';
const output = input.replace(/\.[^\.]*$/, '_adjusted$&');
const data = JSON.parse(fs.readFileSync(input, {encoding: 'utf8'}));
for (const item of data) {
item[2].reading = wanakana.toHiragana(item[2].reading);
}
fs.writeFileSync(output, JSON.stringify(data, null, 0), {encoding: 'utf8'})
Depends on WanaKana; you can copy the file from here.
I would also recommend adding the additional author/attribution metadata to the index file, as described in this comment: https://github.com/FooSoft/yomichan/issues/834#issuecomment-693698073
Big thanks, I got it to work! The only caveat is that, as you mentioned, it seems Yomichan expects certain readings to remain in katakana
For example: キリスト教 -> キリストきょう AC -> エーシー
so words like that don't show up. That's not a big deal, though, as most of those words I wouldn't be looking up the frequency for anyway (not to mention the other sans-reading frequency lists can still catch most of them), so the list is very useable in this state!
Here it is in its current state, with author/attribution data added: BCCWJ_short_freq_v0p5.zip
This might get you better coverage; still not perfect, but better for hiragana and latin characters.
const fs = require('fs');
const wanakana = require('./wanakana.min.js');
const input = 'term_meta_bank_1.json';
const output = input.replace(/\.[^\.]*$/, '_adjusted$&');
const data = JSON.parse(fs.readFileSync(input, {encoding: 'utf8'}));
const isPartiallyJapanese = (input) => [...input].reduce((value, char) => value || wanakana.isJapanese(char), false);
for (const item of data) {
const [expression, , {reading}] = item;
if (expression === reading) { continue; } // Both in hiragana/katakana
if (!isPartiallyJapanese(expression.normalize('NFKC'))) { continue; } // Latin/full width characters
item[2].reading = wanakana.toHiragana(reading);
}
fs.writeFileSync(output, JSON.stringify(data, null, 4), {encoding: 'utf8'})
Wow, what a legend. I think the vast majority of entries are covered at this point!
Anyone feel free to use it: BCCWJ_short_freq_v1.zip
Just as a reference for anyone who wants to give it a go, here are two examples of entries that won't show properly:
[ "キングマン・アンド・アイブズ", "freq", { "reading": "きんぐまんあんどあいぶず", "frequency": 152442 } ],
[ "ラジアン毎秒", "freq", { "reading": "らじあんまいびょう", "frequency": 152442 } ],
Lastly, something to keep in mind is that the data source for this frequency list occasionally distinguishes between different parts of speech. For example if you search 切り
you will see two frequency entries in Yomichan. Checking the original BCCWJ list will show that one instance is for its use as a suffix and the other for as a noun. Just something to be aware of!
I think the real solution will be to eventually implement #461, so that readings are normalized before Yomichan adds them to the internal database. That will likely affect many other things as well, so it's not a simple change, but it should be the most effective one.
Nice, I'll check em out! I'm curious, how did you handle the fact that in the long list they split up when a noun is used standalone vs when it is used as a suru verb? (Eg there would be an entry for 勉強 and 勉強する, iirc)
There might be other stuff the long list splits up as well that makes it trickier to reference words as simply as with the short list
The long list is not handled any differently; both entries are included in the dictionary.
I'll enable all 3 (the old short list, the new short list, and the new long list) for a week or so and look out for any discrepancies!
I suspect the way the long list splits things up / over-specifies might be problematic, as it won't always cross reference properly with the way dictionaries format their entries. Some words will probably appear much less frequent than they actually are, or might not show up at all
For example if you ctrl+f the lists for 席捲
you get the following results:
Short list 20962 席捲 (182 hits)
Long list 22186 席捲する (154 hits) 122640 席捲 (16 hits) 282791 席捲し始める (5 hits)
Even if you are to search 席捲する
in yomichan, all the dictionaries will have their entry as 席捲
, with the する dropped. The long list would then return a rank of 122640, making the word look considerably rare when it is actually fairly common
Yea that seems to be the case AJ = anime & jdrama frequency list/ W = wikipedia IC = innocent corpus Bs = the old short list version from this thread
キリスト教 shows up! nice!
I assume it is intentional that only the first instance on the list is included (looking at アラビア数字 and 切り for example)?
words like AIDS
, HACCP
, AC
show up on the old version, but not on the two new ones
Even if you are to search
席捲する
in yomichan, all the dictionaries will have their entry as席捲
, with the する dropped. The long list would then return a rank of 122640, making the word look considerably rare when it is actually fairly common
That is the expected behaviour when the dictionary doesn't contain an entry for 席捲する
, as it is presumably searching for the case when it is used as a noun or something without -suru. I'm not saying the long unit word version is as useful as the short one, as I don't think most of the dictionaries available are in the same format / have all the same compound words, but generation of a dictionary using that data is supported.
I assume it is intentional that only the first instance on the list is included (looking at アラビア数字 and 切り for example)?
Yes, since there is no support for part-of-speech disambiguation currently. Having multiple entries would likely be confusing to anyone who doesn't know why there are multiple values due to how the source information is presented.
- words like
AIDS
,HACCP
,AC
show up on the old version, but not on the two new ones
This is likely because they have readings that are fully in katakana, whereas the readings in the dictionary are using hiragana. In general, I don't think there's a way to know that this is the case for the base dictionary. For example, it seems that JMDict stores the entries made up exclusively with full width characters in katakana, but partial entries may still use hiragana. Furthermore, other readings may have non-readable characters in them.
{expression: "AID人工授精", reading: "ひはいぐうしゃかんじんこうじゅせい"}
{expression: "ABC順", reading: "エービーシーじゅん"}
{expression: "AV", reading: "エイ・ヴィ"}
Again, this is likely an issue that would need to be resolved at some point on the Yomichan side rather than the dictionary side. The main change is that there is better (but not perfect) coverage for words like キリスト教
, with katakana + kanji.
Gotcha, thanks for the responses. Interestingly, it looks like they got rid of the ・
in AV
recently
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1958780.1
I've actually come to find the long unit list very useful. By comparing the frequency from the short and long unit lists you can infer information, such as if a word tends to be used in isolation or as part of a compound
Taking 席捲
above as an example, the fact that the long unit list returns a significantly lower rank than the other lists indicates this word tends to be used in a compound, not by itself
Initially I assumed the short list was simply an abbreviated version of the long list, but that was obviously missing the point
So yeah I agree it would have been a mistake to try and edit the long unit list to split things up, since that's just what the short unit list is... Thanks!
https://pj.ninjal.ac.jp/corpus_center/bccwj/en/freq-list.html
The Long Unit Word list is a bit too, well, long. But I think the Short Unit Word list would be a great addition to the Yomichan suggested dictionaries list, as there is not yet a frequency list there with reading data.
It seems support was added for this here: https://github.com/FooSoft/yomichan/pull/450
I looked into importing the data myself, but unfortunately I'm not very familiar with the formatting or import process for Yomichan. Reading through the pull request, it also seems like the readings in the list would need to be converted from katakana to hiragana. But there seem to be edge cases for words that have a combination of katakana and hiragana/kanji