JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Add preferred-form/-reading and rare-form/-reading tags #114

Closed parfait8566 closed 2 months ago

parfait8566 commented 11 months ago

My idea of these tags would be, to begin, something like pF (preferred-form) and rF (rare-form). They're respectively for a relatively popular form and a relatively rare form. They each can be assigned to multiple forms (regardless of being kanji or reading) and can be bound to specific senses.

Obviously, this would address #113. Sure, you could bulk edit JMdict data or have JMdict front-ends show the readings first if the term is uk, but it's quite imperfect (as noted by hlorenzi). It's annoying to consider the position of the term in the reading box whenever editing when a tag is much simpler and absolute. And what to do when there are multiple readings, which might not all be common? If you just arbitrarily show the kana forms first you risk of providing wrong information. rF should replace all the rK tags and some rk tags (to be more precise, only if it isn't used to denote rare readings). rK has two quirks: it requires the rare form to have unique kanji (compared to the common form) and an incredibly low percentage of use. I don't feel like either of these requirements are necessary or useful. The average user of a dictionary wants to know which terms are popular, which are just normal and which are rare (regardless if the kanji present are unique compared to the common terms). Note, I do think this is interesting metadata and should be preserved (by adding this data to a custom made setting for rF when bulk replacing, in addition to the irregular tags), but I still believe adding a new more general and lenient (quite subjective but assuming we have only 2 terms, for example, requiring one of them to be <15% to get tagged rare is better than requiring <3%) rare-form tag would be tremendously helpful.

The second set of tags would be pR (preferred-reading) and rR (rare-reading). rR would replace some rk tags. Beginners are often confused by multiple readings and end up learning the one that is relatively quite rare (and/or formal), only realizing it later when speaking with Japanese people or listening to native content.

Additionally, all these tags could be mapped to specific senses.

I understand if you aren't interested in all this, but I still think it's at least worth considering for future JMdict versions. Thank you.

parfait8566 commented 11 months ago

From #113:

Form N-grams %
くじら 765,850 32.4
クジラ 767,918 32.5
830,457 35.1

The word is [uk] and the katakana form is more common than the hiragana form but the kanji form is more common than either of them.

The katakana form can't really be described as the "preferred form" here.

I originally assumed [uk] words would have at least one preferred form, but given cases like this I think there's a need for a cF/ common-form (not preferred, but not rare either) tag as well. Neither of the kana forms can be described as the preferred form, so just tagging them with cF/common-form would be clear enough for applications to display them before the kanji form. There might be other use cases for the tag as well. I don't think cR/common-reading should be needed unless I'm missing something.

Also, for words with multiple senses, there may be senses that aren't [uk]. アナグマ dominates when referring to the animal but the shogi sense is always written in kanji.

Form N-grams %
あなぐま 10,789 9.9
アナグマ 45,310 41.6
穴熊 52,709 48.4

When a word has multiple kana forms and/or senses, I don't think a [uk] tag on its own is sufficient justification for giving greater prominence to a kana form.

This case would be dealt by tagging "穴熊" as pF/preferred-form mapped to the shogi sense, "アナグマ" as pF/preferred-form mapped to the animal sense and "あなぐま" left alone (or tagged as rF/rare-form).

JMdictProject commented 10 months ago

As I commented on #109 I don't think we need more tags for the kanji and kana forms. The guidelines for using the current tags could do with some clarification.

A particular problem with "preferred" tags would be to decide who does the preferring. References often disagree on matters such as which kanji to use or which okurigana to use. And often the Japanese public doesn't follow either. I'm comfortable with ordering forms on frequency and showing the forms which appear in references,.

parfait8566 commented 10 months ago

A particular problem with "preferred" tags would be to decide who does the preferring. References often disagree on matters such as which kanji to use or which okurigana to use. And often the Japanese public doesn't follow either. I'm comfortable with ordering forms on frequency and showing the forms which appear in references,.

Regarding pF/preferred-form and cF/common-form, I think they'd be useful additions for [uk] entries in particular. Like mentioned in #113, currently there's no way to detect from JMdict data which kana-only forms are actually preferred or at least common. Always showing the kana forms first doesn't really work either.

As I said, rK has a few quirks or limitations in my opinion. rK basically translates to "this kanji form is 1. rare (<4%) and 2. contains kanji not present in the common forms". I don't believe the second part is very useful for dictionary users, if a form is rare shouldn't it be noted regardless of the kanji?

Form Percentage
A 85%
B 6%
C 9%

Unless I'm missing something, neither "B" or "C" would be tagged rK because they're >4%. I'm not sure this is useful for dictionary users either. These forms are still relatively rare, I believe the >4% requirement is a bit too high.

My arguments for pR/preferred-reading and rR/rare-reading are similar. As for by what standard you'd need to use them, I think that if 1. multiple monolingual dictionaries redirect from one reading to another and 2. Youglish results reliably prefer one reading over the other you have good reason to use these tags. Other "sources" could be (multiple) anime, manga, audiobooks and so on. Maybe rR/rare-reading could have a "formal" option. Sometimes monolingual dictionaries have both readings, but one is tagged as formal which would suggest it's less used in everyday conversations. Again, all of these tags could be mapped to specific senses which would be useful as well in my opinion.

parfait8566 commented 10 months ago

@stephenmk @Marcusjmdict @robinjmdict

Sorry for the tag, feel free to ignore or say if you find it annoying. I'm curious to what other editors think of this proposal.

stephenmk commented 10 months ago

Regarding the "quirks" of the [rK] tag: a couple of years ago we would always display rare okurigana forms like 呼捨て without qualification as long as they were recorded in Daijirin or Daijisen. I recently proposed that we should add some more rare tags so that we could appropriately tag these forms. But it seems that everyone is now fine with just hiding them with [sK] tags, and such tags are not needed.

For the most part, I'm happy with the current frequency-related metadata tags available to us. If there's an entry that's causing trouble or seems particularly confusing, I think we should try to resolve the problem with the tools available to us. When a pattern emerges that suggests we are in need of a new tool, then we can begin to consider that option. The only example I see above is for 穴熊, which I think we have handled fine with the "usually kana" tags on the first two senses.

Edit: Sorry, there's also the くじら example. That entry also seems fine to me.

parfait8566 commented 10 months ago

If there's an entry that's causing trouble or seems particularly confusing, I think we should try to resolve the problem with the tools available to us. When a pattern emerges that suggests we are in need of a new tool, then we can begin to consider that option.

Taking くじら as an example: Form N-grams %
くじら 765,850 32.4
クジラ 767,918 32.5
830,457 35.1

JMdict currently has no way to tag hiragana forms as common, so the user will only see "クジラ". [uk] is not enough alone, as it might mean:

In this case, the third option is the correct one, but it's not obvious at all and I think the addition of new tags would help make the situation more clear.

Can't think of other examples, but I'm sure it's full of cases like this.

Regarding [pR]/preferred-reading and [rR]/rare-reading, I think it'd be helpful to explicitly take the attention of the users to the more common readings.

parfait8566 commented 10 months ago

Found one example for the usefulness of new tags (for readings)

昨日 has two readings:

There are many words like this. With the tools available right now, the best option would be to add a usage note, but I think new tags would be better.

JMdictProject commented 10 months ago

Our usual approach to something like 昨日/さくじつ would be to have it in a separate entry with a [form] tag. I don't think having such tags associated with readings is appropriate. (In this case, the JEs and most kokugos don't mention it being more formal than きのう.)

parfait8566 commented 10 months ago

Our usual approach to something like 昨日/さくじつ would be to have it in a separate entry with a [form] tag. I don't think having such tags associated with readings is appropriate.

I disagree. There are many entries where one reading is "preferred"/considered more popular and the other is seen as formal or stiff. Having this information directly associated with readings would be useful for users as they'd be guided to the more popular reading.

(In this case, the JEs and most kokugos don't mention it being more formal than きのう.)

三省堂国語辞典: 「きのう」の、やや改まった言い方。 新明解国語辞典: 「きのう」の、やや改まった言い方。 明鏡国語辞典: 「きのう」の改まった言い方。 新選国語辞典: 「昨日・本日・明日(みょうにち)」は、改まった会話や文章に用いる漢語表現である。「きのう・きょう・あした\あす」はそれよりもくだけた和語の系列である。

And while the rest of the monolingual dictionaries might not explicitly say さくじつ is more formal, they all redirect to to きのう.

parfait8566 commented 10 months ago

Another example which might be useful:

Form N-grams %
曲がりなりにも 50,146 59.1%
曲り形にも 44 0.1%
まがりなりにも 34,690 40.9%

"曲がりなりにも" is not really [uk], but it is found quite commonly in kana-only. This seems like useful information to me and currently the only way we could communicate this is using notes.

parfait8566 commented 2 months ago

Dealing with the frequency of various forms: #142 Dealing with reading tags: #144