JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

[sK] tag - definition and criteria #126

Closed JMdictProject closed 1 month ago

JMdictProject commented 2 months ago

[An "issue" is being opened for each of the kanji/reading tags to provide a forum for discussing the definitions and criteria for applying the tags. It is intended to be used to assist in the preparation of documentation of the tags.]

The [sK] tag is associated with kanji forms and is currently defined as "search-only kanji form". It is currently used in 3,473 entries.

The purpose of the tag is to indicate to applications using the database that the form should be used as a lookup key for the entry, but should not be included in a regular entry display due to its relative rarity when compared with the other forms.

The types of kanji forms which would receive this tag include:

The criteria for assigning the [sK] tag will vary according to the circumstances. In the cases of uncommon variant kanji and irregular okurigana forms an n-gram-based threshold of about 5% of occurrences would typically apply. For 混ぜ書き forms, the threshold may be as high as 20%. For 変換ミス cases there would be no particular threshold unless there was a specific reason to keep the form visible.

JMdictProject commented 2 months ago

I see that in the discussion on #77 we agreed on a 3% threshold for splitting 旧字体 forms into [oK] and [sK].

JMdictProject commented 2 months ago

I'm copying in here a comment Marcus made in issue #125, as it's relevant to the discussion of [sK] tags as well.

One thing I've been meaning to bring up is I think we're hiding too many kanji forms. For kyujitai, and rare kanji combos in compound noun entries, I'm all for hiding them. On the other hand, I think hiding away the type of rare or obscure kanji that are still included in one or two kokugos with [sK] rather than tagging them as [rK] can lead to confusion, if somebody does come across one of those rare forms and end up at an entry without any sign of it, or any other hint why they're seeing that entry.

I think that in general we've been avoiding using [sK] for forms that are in kokugos; instead tagging them as [rK].

JMdictProject commented 2 months ago

Rampaa made the following comment by email:

For what it's worth, I also think sK tag is being overused.

I don't think sK tag should be used for forms containing uncommon variant kanji, rK is the better fit for such forms. See for example 断切る, even though it can be found in Daijirin and Daijisen, it has been marked as sK.

断ち切る 180874 断切る 237

While I don't mind 断切る being hidden as it's an uncommon okurigana variant, it could be given an [io] tag and kept visible.

stephenmk commented 2 months ago

断(ち)切る is in most kokugos (daijr/s, etc.), so I don't think we would tag 断切る as irregular.

断切る is 1000 times less common than 断ち切る in both the google corpus and Kyoto/Melbourne corpus. Our entry has two different kanji forms as well (裁ち- and 截ち-) and sense restrictions. The entry is greatly simplified by hiding the forms with omitted okurigana.

I once suggested that we make a rare okurigana tag for these sorts of forms, but I think we agreed that they can just be hidden.

There was recently a relevant discussion on entry 1844170. The general sentiment seems to be that the cutoff for hiding these forms should be somewhere between 3% and 20%.

JMdictProject commented 1 month ago

A page is being developed on the project Wiki which covers these tags, The text of this issue has been transferred there. See: https://www.edrdg.org/wiki/index.php/Kanji_and_Reading_Information_Fields