JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Entries in which all kanji forms are tagged as outdated #103

Open stephenmk opened 9 months ago

stephenmk commented 9 months ago

The current JMdict policy seems reserve usage of the [oK] ("outdated kanji") tag for old itaiji / 旧字体 forms that don't appear explicitly in the usually referenced resources. For example, the tag is used to contrast 合氣道 (old form) with 合気道 (modern form). There was a detailed discussion in issue 77.

By my count, there are currently 56 entries in which all of the kanji forms are tagged as outdated. This is misleading, as it implies that the kana form of the word is the preferred form, or that there exists some other kanji form which isn't outdated (but isn't displayed for some reason).

Rather than tagging the kanji forms as out-dated, I think the glossary itself should be tagged to indicate if the word is archaic, rare, or historical. If there is a modern kanji form of the word, it should be added. If the kanji form is recorded in the usual refs but is rarely used compared to its kana form (e.g. 擤む), I think the form should be tagged as rare [rK] rather than outdated.

sequence kanji forms
2245450 '柃'

Reviewed Entries

sequence kanji forms
~1146020~ '檸檬'
~1186670~ '下萠'
~1337660~ '縮緬紙'
~1337670~ '縮緬皺'
~1568840~ '潛心力'
~1606120~ '倚る'; '凭る'
~1632670~ '擤む'
~1924105~ '太枘'
~1983340~ '捥り'
~2006450~ '蝤蛑'
~2012830~ '鸊鷉'
~2033870~ '宇牟須牟骨牌'
~2093730~ '松濤館流'
~2094190~ '十刹'
~2094490~ '癩菌'
~2094510~ '救癩'
~2096000~ '脇寺'
~2096890~ '兜率天'
~2097160~ '螫す'
~2097210~ '素袷'
~2100220~ '蕭索'
~2138660~ '挵る'
~2163780~ '鵟'
~2176690~ '磽确'; '墝埆'
~2182120~ '鯥'
~2222620~ '襅'
~2230890~ '蜾蠃'
~2231240~ '銀蜻蜓'
~2231250~ '団扇蜻蜓'
~2232860~ '蠔油'
~2241570~ '蒴'
~2256090~ '鮞'
~2263590~ '都鱮'
~2265440~ '螠'
~2270280~ '菩提薩埵'
~2270290~ '薩埵'
~2273320~ '模様莧'
~2273330~ '莧'
~2273340~ '滑莧'
~2398300~ '癤'
~2420350~ '篊'
~2433280~ '胡簶'; '胡籙'; '箶籙'
~2453030~ '駃騠'
~2453260~ '硨磲'
~2454770~ '捥る'
~2656220~ '踠き'
~2476410~ '蝉魴鮄'
~2491150~ '猿麻桛'
~2507320~ '桫欏'; '杪欏'
~2542110~ '巋然'
~2587700~ '枘'
~2514240~ '鈹'
~2514400~ '鍰'
~2647810~ '砰'
~2828101~ '退る'
Marcusjmdict commented 8 months ago

I (and others) have dealt with all entries up to 蠔油., except うんすんカルタ where the problem is that the kanji form is outdated and not in current use, and not included in kokugos,, but there s no orher common kanji form. I don't think there's an issue here tagging it as oK.

stephenmk commented 8 months ago

My interpretation is that there are two criteria for the [oK] tag:

  1. The form contains a 旧字体 form of a kanji instead of the 新字体 version (國 instead of 国, 學 instead of 学, etc.)
  2. The form is still somewhat commonly used (e.g. 合氣道, greater than 3% in the n-grams).

On the other hand, [rK] would be for forms that contain kanji that are rarely used with a word but are still worthwhile to display.

Since 宇牟須牟骨牌 doesn't meet either of my criteria for [oK], I would tag it as [rK].

robinjmdict commented 8 months ago

Yes, 宇牟須牟骨牌 should be rK. We stopped using oK for forms that don't contain 旧字体 several years ago. rK is now often used where oK would have been used in the past.

There are still a lot of oK tags that need to be removed. It shouldn't be difficult to write some code that checks every oK-tagged form in JMdict against a list of 旧字体. The bulk updater could do the rest.

Presumably most of the deleted oK tags could be replaced by rK but we'd need to check the n-grams. This part would be harder to automate.

stephenmk commented 8 months ago

Presumably most of the deleted oK tags could be replaced by rK but we'd need to check the n-grams. This part would be harder to automate.

This is all actually pretty simple to automate. I think the really difficult problem is deciding what is and isn't considered 旧字体.

There's a large list of 旧字体 on this page : https://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html

That page doesn't comprehensively list common 俗字 pairs like 掻・搔, 掴・摑, 嘘・噓, etc. I have a curated list of these kinds of pairs here: https://github.com/stephenmk/jitenbot/blob/main/data/entries/variant_kanji.csv

Even after combining those two lists, I see I'm still missing some pairs like 竃・竈.

So I guess the question we have to answer is whether or not we'd want to tag a form like "竈" as outdated even though it's the form displayed in all recently published references. It might make more sense to remove the [oK] tag from 竈 and put some sort of new 俗字 tag on 竃 instead.

robinjmdict commented 8 months ago

This is all actually pretty simple to automate.

Retrieving the counts and calculating the percentages isn't so hard but I was thinking about false positives. We'd have to do a manual check.

I think the really difficult problem is deciding what is and isn't considered 旧字体.

I think oK should be reserved for 旧字体 that have a corresponding 新字体 in the jōyō or jinmeiyō lists. Extended shinjitai includes quite a few characters that never really caught on and many of the unsimplified forms are still commonly used or even preferred (e.g. 潅・灌, 欝・鬱, 撹・攪, 侭・儘). I think extended shinjitai (or 俗字) and their unsimplified forms should be treated like any other 表外漢字. The rare ones can be tagged as rK.

stephenmk commented 8 months ago

I think oK should be reserved for 旧字体 that have a corresponding 新字体 in the jōyō or jinmeiyō lists.

That sounds reasonable to me. And I think the list at 'asahi-net' that I posted above has all such forms neatly collected.

But what about if the kanji is also in the jōyō or jinmeiyō lists? For example, 竜 is in the former and 龍 is in the latter. Should forms containing 龍 be considered [oK]? I think it would be consistent to do so, but this isn't how these forms have been tagged in the past.

JMdictProject commented 8 months ago

I don't think there's a problem tagging forms with 龍 as [oK] when there's an equivalent using 竜. We do that with 気/氣 (氣 is jinmeiyō).

robinjmdict commented 8 months ago

I agree but I think the description for [oK] would need to be changed to "kyūjitai/old character form". 龍 isn't "outdated".

stephenmk commented 7 months ago

If we're going to have [oK] = old kanji orthography, wouldn't it make sense to have [ok] = old kana orthography?

Right now we're mostly just using the [ok] tag on obscure readings that are only recorded in a handful of the larger kokugos (koj, daij) rather than the smaller kokugos and JEs. Why not tag these readings as "rare" instead? It seems more logical to use the [ok] tag for old kana forms like 一口商ひ, 目合ひ, 二人づつ, etc.

Also, I think the tagging system would be easier to understand if it were more symmetrical. Nobody seems to have problems using the [iK] tag correctly, but the [rK] tag has been a frequent source of confusion.

kanjikanaokurigana
irregulariKikio
rarerKrkro
outdatedoKokoo

Despite having three more tags to remember ([rk], [ro], and [oo]), I think this setup would actually be simpler.

I do wonder if [oo] is really necessary, but I guess technically 一口商ひ would be [oo] and 二人づつ would be [ok].

robinjmdict commented 7 months ago

We currently use [ok] for two types of readings:

  1. archaic readings that are phonetically similar to their modern forms (typically displayed as〔古くは「○○」とも〕notes in the larger kokugos), e.g. へいぎん (平均)
  2. archaic words that happen to share the same kanji and meaning as a non-archaic word, e.g. ない (地震)

I've never been comfortable with this. These are two very different things. I think the tag should be reserved for type-1 readings, and type-2 readings should either be dropped or made into separate [arch] entries. Unfortunately, most [ok] readings are type 2 so it would be a lot of work to implement this change.

I also think it's worth discussing whether obsolete archaic readings should be recorded at all. I don't think they're much use to users.

Fully support adding an [rk] tag. I'm not so sure about additional okurigana tags. We typically hide rare okurigana forms, and how often we would use (or know when to use) [oo]?

JMdictProject commented 6 months ago

We currently use [oK] for two types of readings

0K -> ok

I have now added an [rk] tag.

stephenmk commented 6 months ago

I've never been comfortable with this. These are two very different things. I think the tag should be reserved for type-1 readings, and type-2 readings should either be dropped or made into separate [arch] entries. Unfortunately, most [ok] readings are type 2 so it would be a lot of work to implement this change.

I also don't like these type-2 [ok] readings. I think we should split them into separate [arch]-tagged entries as we come across them even if they do meet the 2/3 merging criteria (same kanji form and same meaning).

For example, I just moved three [ok]-tagged readings (りゅうごう;りんき;りんきん) from our entry for りんご【林檎】 into a new entry. The result is much cleaner.

JMdictProject commented 6 months ago

Splitting these "type 2" ok readings into their own entries is probably the best approach. They're not really a 2/3 problem as they are usually archaic.

Marcusjmdict commented 6 months ago

To what extent do we actually need these as archaic entries, though? I feel that most [arch] words in the dictionary really could go and provide little value.

On Sat, Dec 30, 2023 at 10:44 AM JMdictProject @.***> wrote:

Splitting these "type 2" ok readings into their own entries is probably the best approach. They're not really a 2/3 problem as they are usually archaic.

— Reply to this email directly, view it on GitHub https://github.com/JMdictProject/JMdictIssues/issues/103#issuecomment-1872420881, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUCQIIZKCQPL6H4BVQFALPTYL5WWJAVCNFSM6AAAAAA5ROYQSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGQZDAOBYGE . You are receiving this because you commented.Message ID: @.***>

stephenmk commented 5 months ago

We have about 70 entries in which 搔 is tagged with [oK]. I don't think this is a kyūjitai, because the simplified variant 掻 isn't a jōyō kanji. Kanjipedia (run by the company that does the kanji kentei exam) has 搔 marked as a 印刷標準字体 and the simplified variant marked as 簡易慣用字体 and 異体字.

If nobody minds, I'd like to change those [oK] tags to [rK]. I don't mind clicking through the list manually and approving each edit, but the daily submissions page would get cluttered with all the edits. I guess I could spread them out and only do 10 a day.


Edit: on second thought, I should probably just work on putting together a list of false-[oK] forms like we discussed above. Just like the 搔 forms, I'm sure there are many that can be safely and programmatically changed to [rK] without needing to dive into the details manually.

JMdictProject commented 5 months ago

We have about 70 entries in which 搔 is tagged with [oK]. I don't think this is a kyūjitai, because the simplified variant 掻 isn't a jōyō kanji.

Whether or not the Education Ministry bureaucrats declare a kanji to be 常用 really has nothing to do with the older form being considered a 旧字体.

I glanced at a sample of the entries with forms containing 搔, and I think many of them can be made [sK] as they have low n-grams and are not in references.

stephenmk commented 5 months ago

Whether or not the Education Ministry bureaucrats declare a kanji to be 常用 really has nothing to do with the older form being considered a 旧字体.

Most kokugos are very explicit in their definitions of 旧字体 that the word refers specifically to kanji with simplified forms on the post-WW2 当用漢字 and 常用漢字 lists. Daijirin alludes to a broader sense of the word ("漢字の字体で、古くから用いられていた字体") but goes on to emphasize the narrow sense.

Since the 表外漢字字体表 was only published in the year 2000, it seems the Google n-grams from 2007 aren't a great resource for tracking the usage of these 印刷標準 kanji. Unlike the old characters that were reformed via the jōyō list, the 印刷標準 kanji are definitely being used in many newly published works. Users of JMdict are bound to encounter them and wonder what they are.

I'm not too concerned about whether we tag them as [sK] or [rK] (ideally I think we'd have a new tag specifically for these 印刷標準字体 forms), but I just think it's misleading to group them in with the 旧字体 [oK] forms.

I think many of them can be made [sK] as they have low n-grams and are not in references.

For what it's worth, almost every kokugo published within the past ~15 years uses these 印刷標準字体 like 搔, 焰, 噓, 摑, etc. Daijisen (the online version at least) seems to be the only notable exception.

JMdictProject commented 2 months ago

Just picking up on the classification of terms using kanji from the 表外漢字字体表 list (the 印刷標準字体 shows the "standard" glyphs for printing.)

AFAICT this only becomes an issue when that list revived an old form of a kanji and we now have two or more forms for the same kanji. For an example of this consider 唖 and 啞. The very first JIS kanji standard (in 1978) had the 啞 glyph as the standard. When the 当用漢字 were replaced by the 常用漢字 the glyph was changed to 唖 (the code stayed the same). When the supplementary JIS X 212 standard was released in 1990 啞 was included there, so we had effectively the one kanji with two codes for the different shapes. Then the 表外漢字字体表 list came along and seemed to recommend that people should go back to using 啞 instead of 唖, which had been the norm for 30 years.

We have 11 entries containing 唖 and 啞, e.g. 聾唖学校/聾啞学校. Most if not all of the 啞 cases are tagged oK or sK. For the sake of consistency we probably should make them [rK] until (and if) their usage becomes common enough to be tag-free.