JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Enable usage of [misc] tags (and more) for readings #144

Open parfait8566 opened 1 month ago

parfait8566 commented 1 month ago

This change would allow using [misc] tags (like colloquial, formal, etc.) for specific readings only.

Let's take 昨日, which has きのう and さくじつ as readings. Here is what you might find looking up 昨日/さくじつ: 三省堂国語辞典: 「きのう」の、やや改まった言い方。 新明解国語辞典: 「きのう」の、やや改まった言い方。 明鏡国語辞典: 「きのう」の改まった言い方。 新選国語辞典: 「昨日・本日・明日(みょうにち)」は、改まった会話や文章に用いる漢語表現である。「きのう・きょう・あした\あす」はそれよりもくだけた和語の系列である。

In short, most dictionaries redirect to きのう or explicitly say さくじつ is a more formal reading. Rather than making a new entry for this bit of information, it seems more efficient to just allow the usage of the [misc] tags for individual readings.

This is just one example, there are many entries where a certain reading is unambiguously described by many authoritative sources as dated, colloquial, rare and so on.

Might be useful to include dialect tags as well. If you look up 難しい/むつかしい, most dictionaries redirect to むずかしい (sometimes adding that it's 方言的, 古風, etc.)

There are a lot of entries where such an addition would be very beneficial. Of course, there are also entries where creating a separate entry would still be better and using reading tags might not cut it.

robinjmdict commented 1 month ago

I'm very much in support of this – it's something I've suggested in the comments of various entries.

I'd like to see tags for colloquial, dated and dialectal pronunciations.

But like with [ok] (see #103), I think these tags should be reserved for readings that are phonetically similar to the other reading(s) in the entry, e.g. とんがる (尖る) and かいもん (買い物). If the reading has a completely different pronunciation, it's really a separate word and the tag should go on the sense, not the reading. For this reason, I don't think a "formal" tag would be appropriate for readings.

parfait8566 commented 1 month ago

If the reading has a completely different pronunciation, it's really a separate word and the tag should go on the sense, not the reading.

I believe that by virtue of having a (even slightly) different reading it's already a different word, at least technically. It's just easier for the end user to see all the readings in one place (in most cases) and it's easier for the editors to work with. I don't think "phonetically similar" is something that should be taken into account. Most entries would probably end up fine, but phonetically similar doesn't necessarily mean they have the same etymology or are actually related either. And should popular and frequent readings that are phonetically dissimilar get split out in their own entries?

From #103:

We currently use [ok] for two types of readings:

  1. archaic readings that are phonetically similar to their modern forms (typically displayed as〔古くは「○○」とも〕notes in the larger kokugos), e.g. へいぎん (平均)
  2. archaic words that happen to share the same kanji and meaning as a non-archaic word, e.g. ない (地震)

I've never been comfortable with this. These are two very different things.

I don't agree that these two "types" of readings are meaningfully different. Kokugos do sometimes also use notes for phonetically different readings and they do sometimes have entirely separate entries for phonetically similar readings. I don't believe kokugos distinguishes between these types of readings and I see no reason JMdict should either. If we were to more faithfully imitate their approach, it's not about the phonetic similarity but whether they actually have distinct entries for the readings or not (= readings present in notes only). Having said that, I don't feel too strongly about the handling of [ok] readings (unless they're outright completely dropped) since less people are gonna come across them.

For this reason, I don't think a "formal" tag would be appropriate for readings.

I just don't buy that 昨日/さくじつ and 昨日/きのう are so significantly different that they should be split out in different entries. Again, in my opinion it makes more sense for both the editors and the end user to see both readings in one place. A formal tag for readings seems very useful and appropriate.

robinjmdict commented 1 month ago

It's just easier for the end user to see all the readings in one place (in most cases) and it's easier for the editors to work with.

I don't think it is. We used to merge whenever possible and it led to some really messy entries that were hard to read and edit. Lots of sense and reading restrictions. We won't be going back to that approach.

phonetically similar doesn't necessarily mean they have the same etymology or are actually related either.

It almost always does. If they're not related, they can go in separate entries.

I don't agree that these two "types" of readings are meaningfully different.

Compare these definitions:

とんが・る 【尖る】 (動ラ五[四])〔「とがる」の転〕 「とがる」を俗にいう語。「先の―・った鉛筆」「―・った口」

ない 【〈地震〉 】 〔「な」は土地,「い」は居の意という〕 大地。「よる」「ふる」を伴って用いられ,地震の意を表す。なえ。

There is definitely a distinction between 1) a synonym that is the result of a sound change and 2) a synonym with a separate origin.

I don't believe kokugos distinguish between these types of readings and I see no reason JMdict should either.

All the kokugos use readings as headwords – they don't merge readings at all (if you exclude "...とも"notes). I believe JMdict is unique among JE dictionaries in merging etymologically unrelated readings. My suggestion would make JMdict more like the kokugos (without doing away with merged readings altogether). I think it strikes a happy medium.

parfait8566 commented 1 month ago

I don't think it is. We used to merge whenever possible and it led to some really messy entries that were hard to read and edit. Lots of sense and reading restrictions. We won't be going back to that approach.

Which is why I said in most cases. Splitting out readings based on etymology doesn't seem justified unless you wanted to start including etymology information.

Compare these definitions:

とんが・る 【尖る】 (動ラ五[四])〔「とがる」の転〕 「とがる」を俗にいう語。「先の―・った鉛筆」「―・った口」

ない 【〈地震〉 】 〔「な」は土地,「い」は居の意という〕 大地。「よる」「ふる」を伴って用いられ,地震の意を表す。なえ。

There is definitely a distinction between 1) a synonym that is the result of a sound change and 2) a synonym with a separate origin.

The more verbose dictionaries that include etymology information (which are the vast minority) are naturally gonna make use of it. But they aren't really drawing a distinction in the sense that "same etymology reading = same word" and "different etymology reading = different word".

I believe JMdict is unique among JE dictionaries in merging etymologically unrelated readings. My suggestion would make JMdict more like the kokugos (without doing away with merged readings altogether). I think it strikes a happy medium.

ひっこ・む[3]【引っ込む】 〔「ひきこむ」の転〕

As you can see, describing the origin and sound changes doesn't mean that it groups up readings into boxes in the way you're saying. In kokugos, just by having a different reading it's already a different word. A few dictionaries might include etymology information referencing other entries, but they're still separate entries. There's no real distinction drawn between readings. I don't see why JMdict should follow this etymological approach, especially when it contains no etymology information itself (excluding rare cases).

I think it's best to keep stuff like 昨日/さくじつ and 昨日/きのう merged together, regardless of etymology.