JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

[rK] tag - definition and criteria #125

Open JMdictProject opened 6 months ago

JMdictProject commented 6 months ago

[An "issue" is being opened for each of the kanji/reading tags to provide a forum for discussing the definitions and criteria for applying the tags. It is intended to be used to assist in the preparation of documentation of the tags.]

The [rK] tag is associated with kanji forms and is currently defined as "rarely-used kanji form". It is currently used in 3,006 entries.

The purpose of the tag is to inform users that the form is rarely used in comparison with other kanji form(s) or kana-only forms, but is being kept visible because it occurs in major references such as 国語辞典. It would typically be added to forms that occur with frequencies less than 10% of those of the more common forms.

An example is the 保育 (nurturing, rearing) entry, which also has the 哺育 form with an [rK] tag. The respective n-gram counts are 611,9217 (98.6%) and 28,230 (0.5%). The 哺育 form is in most reference dictionaries, and hence should not be hidden.

JMdictProject commented 6 months ago

As the 保育 is being edited, the following may be a better example to quote:

An example is the 付近 (neighborhood) entry, which also has the 附近 form with an [rK] tag. The respective n-gram counts are 7,671,290 (95.3%) and 85,884 (1.1%). The 附近 form is in most reference dictionaries, and hence should not be hidden.

It's been pointed out that the [rK] tag is not typically used for forms in entries tagged as being archaic.

Marcusjmdict commented 6 months ago

One thing I've been meaning to bring up is I think we're hiding too many kanji forms. For kyujitai, and rare kanji combos in compound noun entries, I'm all for hiding them. On the other hand, I think hiding away the type of rare or obscure kanji that are still included in one or two kokugos with [sK] rather than tagging them as [rK] can lead to confusion, if aomebody does come across one of those rare forms and end up at an entry without any sign of it, or any other hint why they're seeing that entry. I also think that [rK] forms should preferably not be displayed as prominently as other forms to begin with, so the extra forms being "clutter" shouldn't really be an issue here.

JMdictProject commented 6 months ago

I copied the initial part of Marcus' comment into the [sK] (#126) as well, as it's relevant there.

I also think that [rK] forms should preferably not be displayed as prominently as other forms to begin with, so the extra forms being "clutter" shouldn't really be an issue here.

That's really up to the app/site to deal with. For WWWJDIC I might look at dropping the font size back a notch for the rK-tagged forms.

stephenmk commented 6 months ago

I also think that [rK] forms should preferably not be displayed as prominently as other forms to begin with, so the extra forms being "clutter" shouldn't really be an issue here.

I don't agree. The vast majority (certainly >99%) of entries in JMdict contain fewer than 4 kanji forms that are worth displaying to users. I don't think we should take the [rK] tag as a license to load entries with obscure kanji forms that aren't recorded by other dictionaries (see: カルタ).

I think hiding away the type of rare or obscure kanji that are still included in one or two kokugos with [sK] rather than tagging them as [rK] can lead to confusion, if aomebody does come across one of those rare forms and end up at an entry without any sign of it, or any other hint why they're seeing that entry.

Inevitably there are going to be kanji forms that will be useful to have as search keys but are not worth displaying in the main entry. I think it's up to the apps to redirect users in a sensible way. When I search Google for "上り旗," it says "Showing results for のぼり旗" and asks if I want to search for exactly "上り旗" instead. I think this is a reasonable way to expect search keys to work.

My distribution of JMdict for Yomitan makes the redirection process very clear.

ushinau

JMdictProject commented 5 months ago

A page is being developed on the project Wiki which covers these tags, The text of this issue has been transferred there. See: https://www.edrdg.org/wiki/index.php/Kanji_and_Reading_Information_Fields

stephenmk commented 2 months ago

We've been using 3% as the threshold for rK forms for years, but it has become 10% in the wiki definition. Was this intentional?


https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=1602170.1

A* 2022-04-03 03:36:05 Stephen Kraus 6% of total usages. meets the threshold for rK?

A 2022-04-03 15:38:30 Robin Scott Not quite. I think everything we've tagged as rK so far has been <3%.

https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=1608710.1

A* 2023-11-30 15:29:39 Marcus Richert And 3.6% is above the rK threshold. Not seeing it in kokugos though so sK works.

JMdictProject commented 2 months ago

When I opened this issue in April as a step towards that documentation page I wrote 10%. It was possibly a typo as I see in the original issue where we discussed the tag (#9) we discussed 3% as the criterion.

I'll change the Wiki page to 3%. I'll also reopen this for a while.

parfait8566 commented 1 month ago

What is the reasoning behind putting the limit at 3%? It seems to me that 10% would be far more helpful. If a entry has two forms and they sit at respectively 90% and 10% n-gram counts, what would be the problem with tagging the latter as [rK]? It's not like you'd be hiding them or anything.

robinjmdict commented 1 month ago

[rK] was created so that we had a way of indicating that a kanji form is rarely used and is probably not worth learning. Apps and websites using JMdict data may choose to display these forms less prominently (or even hide them entirely).

Any threshold is ultimately arbitrary but 3% is what we settled on. I don't think a form that accounts for 10% of usage can be described as "rare". There are times when I wonder if even 3% is too high, especially when the form has an n-gram count in the hundreds of thousands.

parfait8566 commented 1 month ago

But [rK] is used exclusively for forms that are in multiple Japanese dictionaries, so tagged forms should be at least to some extent worth noting. Maybe a first time Japanese learner doesn't need to mind them, but more advanced learners might encounter them more frequently (especially in literature). In my opinion, the useful thing about the tag is to make it clear how this particular form might be relatively less popular than the other ones. Not that it's "rare" in absolute terms. Which is also why I think it would make sense to use the tag even in [rare] entries.

Let's take バカ as an example: Form N-grams Percentage
馬鹿 7,054,362 40.4%  
莫迦 77,209 0.4%  
バカ 10,312,235 59.1%

莫迦 is [rK] tagged. I don't think what a Japanese learner should take away from that is "莫迦 is rare in absolute terms, I can ignore it and will never encounter it". You'll still gonna encounter 莫迦 a lot, especially if you read any amount of novels. The key point is more "this form is relatively less frequent than the others, which are more popular".

I think it's best to revise the [rK] policy and

Marcusjmdict commented 1 month ago

I don't find the arguments put forward convincing - rK shouldn't be read as "you will never encounter it", yet it's absolutely sound advice to learners to ignore 莫迦. I don't think anybody is keen on arbitrarily changing an arbitrarily decided on number to something equally (or more) arbitrary, not least considering the effort it would take to review all entries that already have rK and rk tags. I think the "job cost" alone makes this argument a waste of time - even if it did represent a minute improvement, we have over 3000 entries tagged rK. Until there's a reliable AI that could automate the entire thing for us, the time spent vetting all of them can certainly be better spent elsewhere.

parfait8566 commented 1 month ago

I don't find the arguments put forward convincing - rK shouldn't be read as "you will never encounter it", yet it's absolutely sound advice to learners to ignore 莫迦

I'm not sure I understand what you're trying to get at here. My point is that [rK] does not mean "rare in absolute terms", it means "relatively rare". 莫迦 is several times more popular than thousands of entries not tagged [rare], [obs], etc. But it's still relatively rare compared to the other forms. It might be sound advice to tell learners to ignore 莫迦, but it would also be sound advice to tell them you're going to encounter this form quite often.

I think the "job cost" alone makes this argument a waste of time - even if it did represent a minute improvement, we have over 3000 entries tagged rK. Until there's a reliable AI that could automate the entire thing for us, the time spent vetting all of them can certainly be better spent elsewhere.

I think it'd be a significant improvement, but I'm definitely not arguing that all editors should be forced at gunpoint to go back to all the 3k+ entries and edit them right now one by one all consecutively. I suggest we change the policy moving forward (1. increase the 3% limit 2. allow its usage in [rare] entries) and anybody willing to edit the old entries can do it. I'd certainly be very happy to help for as many entries I can if the policy is approved.