JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Policy on ateji #143

Open parfait8566 opened 1 month ago

parfait8566 commented 1 month ago

I propose converting [ateji] to [atejiP] (phonetic ateji) and add [atejiM] (ateji for meaning). This would be useful for entries where you can't just use [gikun] in the readings as there are multiple kanji forms that are not all ateji.

robinjmdict commented 1 month ago

I'm not sure it's a good idea to have two different tags ([gikun] and [atejiM]) for what is essentially the same concept.

My preference would be to convert [gikun] to a kanji tag and to split out all gikun/jukujikun readings into their own entries. It would be a lot of work but I think it's a clearer and more intuitive way of conveying the information. This is the kokugos' approach. For example, Daijirin puts angle brackets around jukujikun kanji.

I think it looks a bit odd having [gikun] on a reading when it's the only reading for that word (e.g. やぎ/山羊).

parfait8566 commented 1 month ago

I'm not sure it's a good idea to have two different tags ([gikun] and [atejiM]) for what is essentially the same concept. My preference would be to convert [gikun] to a kanji tag and to split out all gikun/jukujikun readings into their own entries. It would be a lot of work but I think it's a clearer and more intuitive way of conveying the information.

I like converting [gikun] to a kanji tag, but I don't really see how splitting out entries just for gikun/jukujikun entries makes the information clearer or more intuitive. If anything it seems confusing to break the 2/3 rule just for gikun/jukujikun readings.

This is the kokugos' approach. For example, Daijirin puts angle brackets around jukujikun kanji.

Maybe I'm misunderstanding what you're saying, but since JMdict organizes data differently compared to kokugos I think the comparison is missing the mark. The brackets can be handled by sites and apps. This is how Jitendex for Yomitan shows 一昨日 for example:

image

This seems very clear and intuitive to me (certainly more than kokugos).

I think that ideally a kanji tag with options to specify individual readings (in case the entry has multiple readings and not all of them are gikun/jukujikun) is the best deal. If such a thing is impossible to code due to some technical limitations, then [atejiM] is the next best thing in my opinion. Splitting out entries in this case doesn't seem very helpful.

Since we're here, would it be technically feasible to introduce options to specify what particular kanji are ateji (in compound entries, etc.)?

robinjmdict commented 1 month ago

Since we're here, would it be technically feasible to introduce options to specify what particular kanji are ateji (in compound entries, etc.)?

This is another reason I think [gikun] should be a kanji tag. The system we're currently using doesn't work for verbs. For example, in 躊躇う/ためらう, only 躊躇 has a jukujikun reading. Daijrin shows this very clearly:【〈躊躇〉う】.

I think that ideally a kanji tag with options to specify individual readings (in case the entry has multiple readings and not all of them are gikun/jukujikun) is the best deal.

This sounds messy/inelegant. We don't want another form of restriction in the kanji field.

parfait8566 commented 1 month ago

This sounds messy/inelegant. We don't want another form of restriction in the kanji field.

I disagree. I think the opposite is true, splitting out entries generally introduces more confusion.

How I thought a new [gikun] tag might work is using {} brackets.

The kanji field for 躊躇う:

{躊躇}う[gikun]

For 梅雨寒:

{梅雨}寒[gikun]

For 自棄:

{自棄}[gikun=やけ};焼け

This is obviously just the first idea that came to my mind and the actual implementation could be completely different. But I still don't find any of these examples messy or hard to understand. Maybe you could give me examples of entries where this could become very problematic and confusing, but even then I'm convinced they would be a tiny minority. If they are indeed unsolvable, at worst you can just split them out. For the majority of entries, including an implementation of reading restrictions seems easier and more useful for both editors and end users.

JMdictProject commented 1 month ago

Coming late to this topic (I've been travelling a bit.)

I agree with Robin that we should stay with a single [ateji] kanji tag.

Re the [gikun] tag:

parfait8566 commented 1 month ago
  • I realise that it's not a great tag name. As Robin has (tactfully) pointed out, it relates to 熟字訓; not 義訓. It should be changed. [jukujikun] is a bit long; maybe [j-kun]? (Fortunately, it's easy to change these things globally.)

I thought that what he meant was split out the entries based on gikun/jukujikun, not split out the tags themselves. Maybe it was my misunderstanding. A global change would not work because we have both jukujikun and gikun readings. I don't think the difference is meaningful enough to warrant splitting of tags.

  • I'm not at all sure about making it a kanji tag. To me, it's associated with the reading of a kanji compound and not the compound itself.

But it's not really the reading by itself that is gikun or jukujikun, it's the association with a certain kanji form that makes it unique.

Consider 孤児/こじ/orphan and 孤児/みなしご/orphan, where we have a [gikun] tag on the みなしご reading. The entry is merged as it meets the 2/3 criterion. If we move the [gikun] to 孤児, the only way it can be associated with the みなしご reading is to split the entry into two. That seems a pity and rather unnecessary.

This is an interesting example that really shows the flaws with the current tags.
みなし子 [iK] should not have a [gikun] tag, if anything it's [ateji] for 子. I'm not entirely sure 孤 by itself is [gikun], みなしご seems to be a "proper" (using it loosely here) albeit rare reading (https://dictionary.goo.ne.jp/word/kanji/%E5%AD%A4/), but considering the origin comes from 身無し子 I might be wrong.

Consider 自棄/やけ. やけ by itself is not [gikun], 焼け/やけ is "proper".

Whether as a kanji tag or a reading tag you have to either split off entries or lose the [ateji]/[gikun] information (unless we want to use notes).

First of all, it's important to introduce a setting to select specific kanji for [ateji] and [gikun]. There are currently many JMdict entries which have misleading [ateji] data. 阿婆擦れ is tagged as [ateji], but only 阿婆 is ateji. There are many more entries where we can't use the [ateji] because it would lead to incorrect information. This is another reason [gikun] is better suited as a kanji tag. It's also more intuitive to have both tags behave similarly.

It's also worth discussing if these tags should have reading restrictions. In my opinion it would be easier for both editors and users to have entries merged where appropriate.

This is how I thought a rough implementation might work using {} brackets:

The kanji field for 躊躇う:

{躊躇}う[gikun]

For 梅雨寒:

{梅雨}寒[gikun]

For 自棄:

{自棄}[gikun=やけ};焼け

This would avoid splitting 自棄/じき and 自棄/やけ, which seems preferable to me.