JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

New tags for preferred or common forms #142

Open parfait8566 opened 1 month ago

parfait8566 commented 1 month ago

The way JMdict currently works makes it hard to correctly interpret the frequency of certain forms. This was also (kinda) discussed in #113.

Let's take 曲がりなりにも for example: Form N-grams %
曲がりなりにも 50,146 59.1%
曲り形にも 44 0.1%
まがりなりにも 34,690 40.9%

Not quite [uk] as the kanji forms are still almost at 60% of the total, but the hiragana form is still remarkably common. While Sanseido Kokugo Jiten doesn't have them in this particular entry, they often use red parenthesis to say that this word is okay to write in kana only. The problem with Sanseido's parenthesis is that they can be rather ambiguous: "okay to write in kana only" as in okay to write in hiragana, katakana or both? I'd instead prefer some sort of tag like [cF] (common form) which could be applied to individual forms (including the hiragana forms in the readings section) to flag them as commonly used.

The [uk] (usually written in kana only), to some extent, suffers from the same problems as Sanseido's parenthesis. Let's take 辛子 (からし) as an example: Form N-grams Percentage
辛子をつけ 1,763 28.6%  
からしをつけ 3,098 50.2%  
カラシをつけ 1,305 21.2%

Sites and apps could technically infer that since 1. the entry is [uk] tagged and 2. the katakana form isn't first in the readings section, at least one additional commonly used hiragana form must exist. But they still would have no way of knowing if this hiragana form is more or less popular than the katakana form. In this case, からし is considerably more popular than カラシ.

Another example (鯨): Form N-grams Percentage
鯨の肉 6,072 46.8%  
クジラの肉 3,919 30.2%  
くじらの肉 2,989 23.0%

Here the hiragana form is less popular than the katakana one.

I propose two new tags: [pF] (preferred form) and [cF] (commonly used form). You could apply [cF] to commonly used kana forms in entries not tagged as [uk]. If a [uk] entry has a decidedly most common form (like からし for 辛子), you could apply [pF] to it and [cF] to the others ( カラシ). If there's no such form, you can just apply [cF] to the commonly used forms (クジラ and くじら for 鯨).

Marcusjmdict commented 1 month ago

This sounds like a very precise system that would perhaps have been a good way of doing things if we had been doing it from the start (if the ngrams or something had been available), but at this stage, with 200K+ entries, I question whether the incremental improvement that this could result in would be worth the gargantuan amount of work it would require.

parfait8566 commented 1 month ago

Again, I'm not really saying the editors should be forced at gunpoint to re-edit the thousands of entries and work on them nonstop until they're all done. Whoever is willing to go back on old entries and help can do it. If the system were implemented I don't think adding [pF] or [cF] to individual forms would require too much time. And I absolutely think adding these tags (or something similar would be worth it).

robinjmdict commented 1 month ago

It's not clear to me how a "commonly used form" tag would be useful. I don't think it's particularly noteworthy that まがりなりにも is at ~40% when 曲がりなりにも is even more common. Are users interested in knowing that a word is "often but not usually" written in kana"? I'd argue it would be more helpful to know that a word is "very often (≥70%) written in kana" or "overwhelmingly (≥90%) written in kana" but we probably don't need that level of granularity.

Also, what would sites/apps do with this information? How do you convey with labels, symbols, etc. that a hiragana form is a common surface form?

parfait8566 commented 1 month ago

It's not clear to me how a "commonly used form" tag would be useful. Are users interested in knowing that a word is "often but not usually" written in kana"?

That's not really the only or primary use case though.

Here is one of my arguments:

The problem with Sanseido's parenthesis is that they can be rather ambiguous: "okay to write in kana only" as in okay to write in hiragana, katakana or both? I'd instead prefer some sort of tag like [cF] (common form) which could be applied to individual forms (including the hiragana forms in the readings section) to flag them as commonly used.

The [uk] (usually written in kana only), to some extent, suffers from the same problems as Sanseido's parenthesis. Let's take 辛子 (からし) as an example: Form N-grams Percentage
辛子をつけ 1,763 28.6%  
からしをつけ 3,098 50.2%  
カラシをつけ 1,305 21.2%

Sites and apps could technically infer that since 1. the entry is [uk] tagged and 2. the katakana form isn't first in the readings section, at least one additional commonly used hiragana form must exist. But they still would have no way of knowing if this hiragana form is more or less popular than the katakana form. In this case, からし is considerably more popular than カラシ. I propose two new tags: [pF] (preferred form) and [cF] (commonly used form). You could apply [cF] to commonly used kana forms in entries not tagged as [uk]. If a [uk] entry has a decidedly most common form (like からし for 辛子), you could apply [pF] to it and [cF] to the others ( カラシ).

English is not my first language so I probably wasn't very clear, sorry. What I meant is that it's hard to correctly interpret the data regarding the frequency of 辛子/からし and its forms. It's [uk], which means it's usually written in kana only. There's also カラシ as [nokanji]. Since カラシ is not first in the reading fields, one can assume that there must be another popular kana form, からし. But there's no way to tell if it's more or less popular than カラシ. In this case, it's by far the most popular way to write 辛子/からし, which is worth-noting. Another problem with this is that it means that technically カラシ could be interpreted as being less common that 芥子 (which is [rK].

I don't think it's particularly noteworthy that まがりなりにも is at ~40% when 曲がりなりにも is even more common.

As I said before, this is not the only case. But I do still it's useful even in situations like that. Sanseido uses parenthesis even for entries that'd be exactly "often but not usually written in kana only".

Also, what would sites/apps do with this information? How do you convey with labels, symbols, etc. that a hiragana form is a common surface form?

Well, it depends by how they handle showing forms.

This is Jisho.org's current entry for 辛子/からし: image If the tags I talked about were introduced, this is how it should look: image

The form being given the most prominence is からし. カラシ is shown before 芥子.

Jitendex for Yomitan uses tables to show forms and their characteristics (a better approach in my opinion): (this version still has 芥子 as [oK]) image

I don't think it should be too hard to hypothetically introduce these tags to Jitendex, but @stephenmk knows best obviously.

stephenmk commented 1 month ago

Usage frequencies in jmdict are currently described using somewhat broad strokes. I think this is fine, because the sources we're working with only give us a general idea of usage trends. There's no authoritative source (afaik?) that decrees からし must be written in hiragana or katakana. The google corpus from 2007 isn't flawless.

I'm not convinced that there's anything important at stake here. If jmdict influences someone to write くじら instead of クジラ, does it matter? Inevitably, the way that people choose to write will be influenced by their own exposure to the language or (perhaps more likely) whatever their IME chooses to display first.

parfait8566 commented 1 month ago

Usage frequencies in jmdict are currently described using somewhat broad strokes. I think this is fine, because the sources we're working with only give us a general idea of usage trends.

The point of discussion isn't really whether we should be precise down to the smallest decimal points or use somewhat broad strokes. It's obvious that the latter is the right approach. The question is that currently it's very hard to interpret the data correctly. There's no way to tell users that からし is by far the most common way to write 辛子/からし. Jisho.org even assumes that カラシ comes after the [rK] 芥子.

There's no authoritative source (afaik?) that decrees からし must be written in hiragana or katakana.

There's no authoritative source that decrees that 芥子 should not be used either. It's still useful for end users to know that it's relatively rare. JMdict doesn't really linguistically prescribe anything and new tags wouldn't change that.

I'm not convinced that there's anything important at stake here. If jmdict influences someone to write くじら instead of クジラ, does it matter? Inevitably, the way that people choose to write will be influenced by their own exposure to the language or (perhaps more likely) whatever their IME chooses to display first.

Again, this is not a matter of how we want people to use the language and I never mentioned or implied it anywhere. What's more of interest is recording and describing how the language is used. The forms and their relative frequency quite obviously is covered by that. Being an online project also makes the work easier. It's also useful for end users so they might bring more attention to the most frequent forms. If recording this stuff was worthless, why have rare tags and priority tags at all? I'm asking for a tag for every percentage. Just two tags which are very straightforward and easy to understand.

JMdictProject commented 1 month ago

An interesting discussion.

When I first put in the [uk] tag over 20 years ago I just had in mind the need to flag that some entries, despite there being kanji forms available, were usually written in kana. からし was the common form; not 辛子. It's turned out to be useful, and most apps/sites respond to it (I'm surprised jisho.org doesn't, but there is a new version coming. WWWJDIC puts the kana version first: https://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MDJ%BF%C9%BB%D2). There are about 9,000 entries with the [uk] tag.

I guess it would have been better to have two tags: [uk] (usually in kana, for cases where 60+% are in kana), and [ofk] for 40-60% kana. That might end up with another 2k entries tagged. I'm not really sure it would be worth the effort.

parfait8566 commented 1 month ago

My suggestion isn't really for [ofk] tag. This is what I said:

While Sanseido Kokugo Jiten doesn't have them in this particular entry, they often use red parenthesis to say that this word is okay to write in kana only. The problem with Sanseido's parenthesis is that they can be rather ambiguous: "okay to write in kana only" as in okay to write in hiragana, katakana or both? I'd instead prefer some sort of tag like [cF] (common form) which could be applied to individual forms (including the hiragana forms in the readings section) to flag them as commonly used.

I think the discussion boils down to: is it noteworthy for users of the dictionary that e.g. the most common way to write 辛子/からし is からし? That 曲がりなりにも can often be written as まがりなりにも? Or that 鯨/くじら is commonly written as both くじら and クジラ? I need to stress that my point isn't "you can only write からし" or "you should use クジラ instead of くじら". JMdict, by virtue of being an online dictionary, lends itself really well to recording this sort of frequency information and I believe it'd be very useful.

I'm not really sure it would be worth the effort.

I think it'd be worth it if we were to gradually start using the new tags. Definitely not saying we should review all the current entries one by one.

Kimtaro commented 1 month ago

I've been embarrassed by how the current Jisho handles [uk] for a long while, and the new version takes it into account for the main form of the headword. I am also playing around with a table format to show all forms, but it's an unfinished thought at this point and I keep changing it on a weekly basis. Should a new tag be adopted I promise I won't take 15 years to handle it.

Here's a screenshot from the latest development version.

Screenshot 2024-10-14 at 13 04 35