JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Katakana first for species etc. #113

Closed Marcusjmdict closed 5 months ago

Marcusjmdict commented 9 months ago

I'm not really happy with the situation we have on species entries where the katakana form is the most common, but is listed second. On Wwwjdic, there's a hack of some sort that identifies these entries and puts the katakana version first, but on Jisho for instance there is not, and the hiragana and kanji are listed first, with the actual preferred form listed in a small type under "Other forms".See チンアナゴ for example.

Could we start listing the katakana reading as the first reading in these entries? I also considered suggesting a tag for "preferred form" but it seems like needlessly complicated.

hlorenzi commented 9 months ago

I've come across similar issues while working on my frontend.

For チンアナゴ in particular, I detect the presence of the [uk] sense tag and bring the kana-only headings to the front of the list, which seems to solve the issue.

But for something like 斑鳩 いかる, the hiragana heading is marked as being more common than the katakana one, so we get the hiragana version shown first. Which maybe is correct? But I don't have confirmation that this is the correct interpretation of JMdict data.

JMdictProject commented 9 months ago

The reason the katakana "nokanji" forms have usually been placed at the end of the list is to separate them from the readings that are associated with the kanji forms. Ideally, they should be in a separate category. Apps and servers can display the database components however they like, and WWWJDIC moves the kana forms to the front for (uk) entries and leads with the katakana one for species names (I think of it as a heuristic rather than a hack.)

That said, I guess there is no reason the katakana form can't lead in the case of species names and any other entries where it is the most common form. It shouldn't be universal - 泥棒 and どろぼう are more common than ドロボー.

If there is general agreement on the approach, it may be possible to do some reordering using the bulk updater. It can't reorder elements but it might be possible to achieve it by deleting and adding. I'll have to experiment a bit.

Kimtaro commented 9 months ago

I don't have an opinion on how these should be ordered in JMdict, but for what it's wort the new version of Jisho that I'm actively working on has the same bring-uk-readings-to-the-front heuristic.

Marcusjmdict commented 9 months ago

To clarify the problem for those not aware, it's not the case with entries tagged [uk] that it's always the [nokanji] reading (usu. katakana), if there's on present, that's the preferred form. For such entries there's currently no way to infer from our data which of the readings is the preferred one.

JMdictProject commented 9 months ago

it's not the case with entries tagged [uk] that it's always the [nokanji] reading (usu. katakana),

True, there are entries with [nokanji] readings where that is not the most common kana form.

I don't have a problem with moving away from the practice of having [nokanji]-tagged readings at the end of the list, and having a more general frequency-based ordering.

robinjmdict commented 9 months ago

I don't object to putting [nokanji] forms first but I'm not sure it's appropriate in cases like this:

Form N-grams %
くじら 765,850 32.4
クジラ 767,918 32.5
830,457 35.1

The word is [uk] and the katakana form is more common than the hiragana form but the kanji form is more common than either of them.

The katakana form can't really be described as the "preferred form" here.

Also, for words with multiple senses, there may be senses that aren't [uk]. アナグマ dominates when referring to the animal but the shogi sense is always written in kanji.

Form N-grams %
あなぐま 10,789 9.9
アナグマ 45,310 41.6
穴熊 52,709 48.4

When a word has multiple kana forms and/or senses, I don't think a [uk] tag on its own is sufficient justification for giving greater prominence to a kana form.

Marcusjmdict commented 8 months ago

I don't object to putting [nokanji] forms first but I'm not sure it's appropriate in cases like this: [鯨/くじら/クジラ]

There's a handful of entries like 鯨 I suppose where it's the combined kana forms that beat out the kanji while neither kana form is on its own more common than the kanji, but they are sort of fringe cases. I don't think we need to create an exception for them because 1) we already have more than enough complicated rules and exceptions and remembering all of them is enough of a headache as it is and 2) I don't see how the exception would actually would be useful. Assuming dictionary app makers interpret [uk] to broadly mean "the first kana surface form should be most prominently displayed", which I think is the most reasonable interpretation, then making an exception in cases like 鯨 would only lead them to instead display the least common form of the 3.

When a word has multiple kana forms and/or senses, I don't think a [uk] tag on its own is sufficient justification for giving greater prominence to a kana form.

I think I agree, unless the secondary senses are obscure or borderline obscure.

JMdictProject commented 5 months ago

We've accepted that it's OK for katakana "[nokanji]" to go first when it's clear they are the most common form. I think this issue can be closed.