Criteria for including kanji forms in entries

JMdictProject commented 2 years ago

I would like to include in the Editorial Policy at https://www.edrdg.org/wiki/index.php/Editorial_policy#Kanji/Special-Character_Forms some guidelines for the inclusion or exclusion of less-common kanji surface forms. There has been quite a bit of debate on this during the editing of entries, and it would be good to get a clear position in place. The policy would apply to forms which differ in the kanji used, or in kana usage in terms of okurigana, alternative readings, katakana/hiragana usage, etc.

What I have in mind is stating that a surface form can be included if it meets at least one of the following criteria:

it appears in a dictionary (国語辞典, 和英辞典, etc.) or a major glossary;
it has a Google n-gram count of at least 2% of the total n-gram counts of all the surface forms possible for the entry;
it has a Google n-gram count of at least 500.

The third criterion - an n-gram count of at least 500 - may seem low, but I like to provide as much support as possible for text-glossing systems, and that seems a reasonable number of occurrences to make it worth including a form. If we only have, say, a 2% criterion it could mean a common term with around 1M counts would see a variant form with "only" 15,000 hits being excluded as it wasn't over 2%. I think we need to take both proportions and counts into consideration.

Marcusjmdict commented 2 years ago

My personal two criteria I follow when deciding whether to add something:

In a major dictionary/glossary OR
Has a Google n-gram count of at least 5-15% of the total

I would not like to see an absolute number criterion, whether it's for an n-gram count of 500 or 100K.

I understand that having more surface forms are good for text-glossing but in most cases, it's something I think should be left up to the text-glosser. I think we should think twice before cluttering entries with obscure surface forms not in other references. They make the entries look messy and harder to interpret.

stephenmk commented 2 years ago

If I'm understanding things correctly, it seems the contention over オンナノコ stems from difficulties with sentence parsers. If I were to search オンナノコ on jisho.org or scan the word with Yomichan, these systems would have no trouble converting the word to おんなのこ and finding the corresponding entry in JMdict, even though JMdict does not contain オンナノコ. However, if I were to paste a full sentence containing オンナノコ into jisho.org, the word boundaries would not be as well-defined and its parser may have trouble recognizing the word since it is not explictly contained within the database.

I know there are projects (such as sudachi) and dictionaries (UniDic and NEologd seem to be popular right now) dedicated to parsing and tokenizing full sentences. I don't have much experience with these tools, but I know they have some relational capability (e.g., they can recognize that ひしめいていた is a kana form of 犇めく). Is the idea behind this proposal that JMdict should compete in this space as a viable alternative or supplementary dictionary?

On the issue of clutter, I think it should be left up to the presentation layer to decide how many surface forms to display to the user (provided that rare and irregular forms are appropriately tagged in JMdict). If the clutter caused by having too many surface forms is a problem, then it's already a problem under the current policy; there are plenty of common entries which contain a variety of obscure surface forms and/or reading restrictions (タバコ, 箱, 買い物, 良い, etc.). To the extent that this is already a problem, I'm not sure it would be significantly worsened by a slightly relaxed inclusion policy (at least as far as the addition of novel kanji and okurigana usages goes).

If 登り旗 were a completely new word and not a variant of のぼり旗, I think there would be a good case for adding it to JMdict. It gets 857 n-gram counts and there's ample evidence that it's currently in use. I came across it while reading a novel (I had never seen any form of のぼり旗), so surely other JMdict users will run into it too. It seems odd to me that we should exclude it because of its rarity relative to an alternate form.

For kanji forms which merely differ in kana usage (e.g. おんなのこ and オンナノコ, or 買物カゴ and 買物かご), I'm more ambivalent. Provided that word boundaries are established, tools such as Yomichan or jisho.org currently have no problem converting the kana and finding the corresponding entry in JMdict.

I have another question regarding the proposed criteria. If a variant form does not appear in a dictionary or major glossary, how would we verify that an alternative form is truly an alternative form? What sort of evidence would suffice?

For example, our entry for しみじみ now has three kanji forms which all appear in major dictionaries: 沁み沁み, 染み染み, and 沁沁. A variant form, 沁々, gets 750 n-grams and would be included under the proposed criteria. Probably no evidence is required since it seems fairly obvious. How about, by analogy, 染染 and 染々 ? Both get over 900 n-grams. I have no idea if they are truly forms of しみじみ, however. It seems to me that some sort of extra evidence would be required to justify their inclusion, and I think this needs to be made clear in the criteria that n-gram counts alone aren't always enough (although perhaps this is already implied and widely understood).

Marcusjmdict commented 2 years ago

To the extent that this is already a problem, I'm not sure it would be significantly worsened by a slightly relaxed inclusion policy (at least as far as the addition of novel kanji and okurigana usages goes).

I believe we do have a problem with clutter already and I would prefer to move towards less of it, not more, whether it's incremental or not. I do think that Jim's 3rd suggestion could make many entries considerably more cluttered if followed to the letter.

stephenmk commented 2 years ago

I don't have years of experience contributing to JMdict, and so I think it's not unlikely that I'm missing some valuable insights. This is just my humble opinion; thank you for taking the time to humor it.

If clutter is a problem, then I think it should be addressed seriously. We could clear huge swaths of information and keep only the most essential kanji forms and readings, but to me this doesn't seem desirable or necessary. Instead, why not make it clear to application developers which forms are essential? Allow them to decide how to format and present the data to users in a way that is not cluttered.

The idea of "hidden" fields has been discussed somewhat (#46). One of the objections was a concern that developers may not immediately pick up on the idea. JMdict is an amazing resource that I imagine will long outlast all of the programs which currently use it, so I don't think we should allow short-term concerns to restrict its long-term potential, its range of applications, or the creativity of the developers who use it.

However, I understand the need to be pragmatic. Couldn't EDRDG's systems be configured to produce a new XML file (say "JMDICT_EXTRA" or something) which contains the hidden fields while also continuing to produce a (deprecated?) JMDICT XML file which does not? Systems which fetch JMDICT daily could then continue to operate without requiring any changes. Or, since a next-gen JMdict is in the works and would require application developers to make changes anyway, maybe the "hidden" ("low-priority"? "obscure"? "rare"?) fields could be added into that.

Or perhaps rather than placing the hidden fields into JMdict-proper, the hidden fields could be used to construct a differently formatted token dictionary that could be compatible with e.g. MeCab. (Admittedly I don't know much about MeCab's format and need to learn more about how it works).

Marcusjmdict commented 2 years ago

If clutter is a problem, then I think it should be addressed seriously. We could clear huge swaths of information and keep only the most essential kanji forms and readings, but to me this doesn't seem desirable or necessary.

To be clear, this is not what I'm arguing for - I don't think all "clutter" is absolutely undesirable, but there's a balancing act to be considered. I'm all for including obscure and unusual readings and kanji, but when it comes to rare surface forms that are merely orthographic variants not covered by other dictionaries, that's where I'd like to draw the line.

stephenmk commented 2 years ago

I agree that the current standards for inclusion are appropriate for determining which forms are important enough to be displayed to a user in a dictionary app. Anything outside of that scope could easily be labeled as such. I don't think "hidden" is a very descriptive label, but perhaps one could be chosen (how about "clutter"? 😄) to convey that the data does not meet the regular standards for inclusion. As it stands now, I feel like we're discarding good and perfectly useful surface forms simply because they don't have the blessing of the major dictionaries (compare: ⭕️昇り旗 vs ❌登り旗). I think it's a limitation on the versatility of JMdict.

These variant forms aren't as easy to verify as forms that can quickly be found in a dictionary, so it's arguable that their inclusion could place an extra burden on the editorial process that is disproportionate to their value. I'm the least qualified person here to speak on this, but to me it doesn't seem so bad. As long as it's made clear that the forms are not held to the same editorial standards, then the provision of some real-world usage examples and some n-gram counts seems like sufficient evidence for inclusion. I don't expect that we'd see a flood of these sorts of submissions, especially if Jim's proposed criteria are adopted. If this puts an end to the discussions about whether forms are worthy of inclusion, it may actually decrease the overall editorial burden.

stephenmk commented 2 years ago

Marcus left a note on the entry for 瞑想 today that I think is relevant to this discussion:

not a fan of an rK-tagged form (冥想) coming before a non-rK form (めい想). this is a good example why I would like to see [rK] reworded as just [rare surface form] or whatever so that めい想 could also get the same tag.

めい想 is only in the entry to aid lookups, right? None of the usual references seem to show it. I think it might be a good candidate for a "hidden" / "clutter" / "lookup" field. Alternatively (or additionally), a tag for 交ぜ書き forms might be nice. I'm not sure there's a way to programmatically determine with 100% accuracy whether or not a form is 交ぜ書き, so there could be value in having it made explicit.

I like how [rK] is narrowly defined and yet still manages to capture a lot of data under its net. Categories generally become less useful as they become broader in scope. I'm all in favor of creating new metadata tags, although I don't know how enthusiastic everyone else is about juggling around more of these code abbreviations. (With the user interface being as it currently is, it certainly increases the barrier to entry for new users too.)

JMdictProject commented 2 years ago

PIcking up on a point from a few days ago. Marcus wrote:

I understand that having more surface forms are good for text-glossing but in most cases, it's something I think should be left up to the text-glosser.

The problem there is the glossers rely on lexicons. One of the principles I applied early in this project (~30 years ago now) was to have a single dictionary that could meet as many demands as possible, including supporting glossing systems, and that's why I'm keen to keep the coverage broad. I really do not want to split entries, or worse still particular surface forms, off into a separate glossing dictionary. It would add complexity to the overall project with little if any real benefit.

JMdictProject commented 2 years ago

Just a quick comment on:

めい想 is only in the entry to aid lookups, right? None of the usual references seem to show it.

I have no problem with めい想 being given an "rK" tag - it's a rarely-used kanji (containing) surface form. I have now added it to the entry.

JMdictProject commented 2 years ago

Stephen wrote:

As it stands now, I feel like we're discarding good and perfectly useful surface forms simply because they don't have the blessing of the major dictionaries (compare: o昇り旗 vs x登り旗). I think it's a limitation on the versatility of JMdict..

I absolutely agree. I think we have some way to go yet both in the criteria for inclusion of forms, and also in the exploration and discussion of possible "hidden" forms (and even further to go if we move towards implementing them). In the meantime, can I ask that people NOT drop existing forms unless there is a near-unanimous agreement. It's actually hard to track them down if there's a later change.

robinjmdict commented 2 years ago

I have no problem with めい想 being given an "rK" tag - it's a rarely-used kanji (containing) surface form.

This isn't how we currently use the tag, hence Marcus's comment about wanting to change it. rK indicates that the form contains kanji that is rarely used in that word. It doesn't mean "this kanji-containing surface form is rarely used".

My position is that rare surface forms like めい想 should be searchable but "hidden", for reasons I outlined in #46. I don't think they should get the same "rare" tag as forms that appear in the major references.

I agree with Marcus that clutter in JMdict is a problem, and I think we should be looking at ways to reduce the number of surface forms that are displayed to users. I'm very much opposed to including irregular/variant surface forms that meet some low n-gram threshold like 500 counts unless we have a way of indicating that they should be hidden away.

stephenmk commented 2 years ago

Robin left this comment on the entry for 買い物 today:

I can see the argument for keeping かいもん on this entry (as it's not really a separate word) but I don't like that it introduces unsightly restriction tags (if 買いもの and 買いもん are included as surface forms). I think this is another illustration of how "hidden" readings/surface forms could be very useful. It would allow us to include -もん forms on other -物/-者 entries without adding any restr tags. I'd argue that even the reading doesn't need to be visible; what matters is that a -もん search takes you to the right entry.

Small digression: I think the unsightliness of the restriction tags depends upon their presentation. When these orthographic variants are organized into a table, for example, it's painless to assess an entry at a glance.

	買い物	買物	買いもの	買いもん
かいもの	★	★	●
かいもん	●	●		●

I just posted a dictionary file for Yomichan which contains these tables for all JMdict entries with a minimum amount of kanji forms and readings. (This is how I spotted that we had 買いもの + かいもん as a pair).

At any rate, it seems we're all more or less in agreement that "hidden" tags/fields would be a nice feature to have. (I'm always in favor of providing more well-defined and consistent metadata to developers). 「買いもん」 might be a good candidate for it, although I don't think its inclusion causes much harm when presented in a sensible way.

JMdictProject commented 2 years ago

When these orthographic variants are organized into a table, for example, it's painless to assess an entry at a glance.

Yes, when first put the XML structure for JMdict together 20+ years ago I envisaged some sort of tabular layout but it wasn't really possible in the markup available.

This issue has broadened into mentions of two other distinct topics: the possibility of having "hidden" surface forms, and the handling of もの/もん entries, I think these should be discussed in their own issue threads, leaving this one for inclusion criteria. I plan to open two further issues to handle these, hopefully later today.

JMdictProject / JMdictIssues

Criteria for including kanji forms in entries #63