Open briankrznarich opened 1 week ago
I should point out that the documentation/guidelines on the use of [sK] can be found on the page at: https://www.edrdg.org/wiki/index.php/Kanji_and_Reading_Information_Fields#[sK]_Kanji_Form_Recommended_as_Search-Only
As stated there: "The criteria for assigning the [sK] tag will vary according to the circumstances. In the cases of uncommon variant kanji and irregular okurigana forms an n-gram-based threshold of about 5% of occurrences would typically apply. For 混ぜ書き forms, the threshold may be as high as 20%."
Both the examples Brian has raised fall in the 5%-20% range, which is where it's a case-by-case decision.
For the 丸のみ form's appearance in the 丸呑み entry, the n-gram count for 丸のみ is significantly increased by the fact that it's the most common form for another まるのみ entry (丸のみ; 丸鑿; 円鑿 【まるのみ】 (n) gouge; scorper; scauper). All the occurrences for 丸のみ in Eijiro and GG5 examples are for this entry; not 丸呑み. I think the decision to hide the 丸のみ form in the 丸呑み is appropriate.
For the はやり病 form within the 流行り病 entry, I think I made a mistake in my edit of 8 October. I meant to add [sK] to the 流行病 form. That form has since been tagged as [io]. Again, there is a distortion of the n-gram counts as 流行病 is also the kanji form for the りゅうこうびょう entry. I will remove the [sK] from はやり病.
まん延 as an alternative form for 蔓延 also lies in the 5-20% range, but as pointed out, there are good reasons for keeping it visible.
For the 丸のみ form's appearance in the 丸呑み entry, the n-gram count for 丸のみ is significantly increased by the fact that it's the most common form for another まるのみ entry (丸のみ; 丸鑿; 円鑿 【まるのみ】 (n) gouge; scorper; scauper).
丸鑿 is not [vs] so in this case the n-gram is probably pretty accurate.
I agree that lowering the bar for forms in kokugos would be a good idea. As said before 丸のみ is in Sankoku but I also see it mentioned in Meikyo and Jitenon's dictionary.
I should have checked for official updates to the [sK] policy in the wiki, sorry about that. I only went back and reviewed what I could find in the github conversations. I would have still opened this ticket, however.
To lead with a proposal:
Generally, I don't see the rationale for setting an "exceptional" threshold just because a term is 混ぜ書き. If anything, the fact that something like 蔓延 can be "correctly" written as まん延 is extra-notable, precisely because 混ぜ書き forms are usually not the norm (let alone terms that begin with kana, and transition to kanji). Displaying them is directly useful for end-users of this dictionary. Non-advanced users are not going to be able to recognize non-joyo forms that can be subbed out.
@parfait8566 noted that I used "丸のみして" to exclude gouges, so I think ~15% is a fair lower-bound. An additional concern is that when we show 丸のみ as a surface form for "gouge"(carving tool), and hide it for "swallow whole", we give the implication (I'm sure unintended) that a random instance of 丸のみ is more likely to indicate a carving tool. That's just not the case. Metaphoric usage of "swallowing things whole" absolutely dwarfs people talking about gouges.
Finally, I'd raise the point (though I'm sure the people here are generally aware) that the ngram database has quite a long historical tail. While IME is making some old kanji forms more popular, in other cases "friendlier" kana and 混ぜ書き forms are becoming more commonplace. Rather than being a one-off, I think the complete reversal of ngram statistics for まん延防止/蔓延防止 reflects this. I think we should be cautious about excluding "modern" forms against an arbitrary threshold in a very broad ngram database. "gouge", for example, probably gets a lot of kanji ngrams from 19th and 20th century literature, when wood-carving was more common. But if modern Japanese people have switched to using 丸ノミ today (easily googled, by the way), then that is still a relevant form today, even if the ngrams seem to favor the kanji form.
Not that we have many good/accurate ways to gauge strictly-modern usage. It's just something to consider.
A good point about 丸鑿 not being. [vs]. I see that 中辞典 and ルミナス use 丸のみ in examples in their 丸呑み entries, so it makes sense keeping it visible there. I have removed the tag now.
It's worth bearing in mind that the purpose of the [sK] tag is to reduce the "clutter" when there are multiple forms. As there are several criteria which can be used, including the number of forms, the relative frequencies, the jōyō status, etc. it has to be a case-by-case decision. I think our current fairly broad guidelines are OK. Individual cases are best discussed in the editing process comments, and in borderline cases I'd favour keeping the form visible.
Sounds like we're not really at odds here, for the most part.
I get the feeling that the media I am currently going through has a 混ぜ書き preference that's going to run me into a bunch of terms, and I didn't want to go wildly editing entries with no guidance on whether the edits were welcome. I didn't even realize the jōyō problem with 丸のみ when I started, I was just going off of percentages. When the result was only 3 surface forms, that didn't seem like too much clutter to me, but "clutter" gets into personal preferences.
I'll try to keep this all in mind in how I approach my edits and see how it goes. I appreciate the feedback.
Related if not entirely: have we made a clear decision on when to tag rare okurigana forms that are included in kokugos as [sK]? In one way it makes sense to always include them, so that apps/dictionary sites have the option to display headwords like they appear in Daijisen/Daijirin etc., e.g. "取(り)扱う". But OTOH sometimes they're so rare it makes little sense to prescribe them. (in the case of 取り扱う, we do currently hide 取扱う, though it clocks in at 5% of the usage in the ngrams)
This is a policy question/comment.
I have two example terms I've encountered recently, both in subtitles in popular media, both seen in mixed kana/kanji [sK] forms. I got one rejection to removing[sK] (丸呑み), and figured it would be better to ask than to repeat the same edit.
丸呑み: -> 丸呑み;丸飲み;丸のみ[sK] 流行り病 -> 流行り病;流行病[io];はやり病[sK]
丸のみ is 14.5% of all forms, はやり病 is 20% (if you exclude overlaps with 流行病 りゅうこうびょう, anyway). These are way higher than the previously discussed thresholds to [sK] forms, so I'm wondering about specific policy decisions that may have been made.
I'd also like to make a general case for restoring the visibility these forms. First, the kokugos often flag kanji which are either "optional" or "not often written". In the cases above, the headwords for 流行り・はやり and 呑み・のみ are both flagged as such in sankoku. So they have some claim to "legitimate" kokugo support in the derived terms. Second, for learners without a mastery of all kanji, knowing that it is "acceptable" to substitute particularly difficult/uncommon kanji can be useful information.
I'm all for cleaning up long lists of forms on entries. But if it's the difference between one or two forms, or two or three forms, I think it would be more useful to see all the forms.
Here's a particularly nice newspaper article filled with 丸のみ, just as an example: https://c.okinawatimes.co.jp/index.html?kijiid=OTPK20230527A0025000100735006
Edit: ran into another (but not currently sK): 蔓延・まん延. まん延 is 10%, but we have a new covid-related term
まん延防止/蔓延防止, which is 64% in favor of the 混ぜ書き form. If anything, Japanese seems to be moving in this direction...
流行り病 | 6913 | 70.3% はやり病 | 2009 | 20.4% はやりやまい | 907 | 9.2%
丸呑みした | 3212 | 57.9% 丸飲みした | 1423 | 25.6% 丸のみした | 841 | 15.1% まるのみした | 76 | 1.4%