JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Hidden kanji and reading fields #46

Closed JMdictProject closed 2 years ago

JMdictProject commented 2 years ago

In the discussion on the soon-to-be-deleted 三蜜 entry (2850792), Robin wrote: "To aid searches, maybe we should consider adding hidden kanji and readings fields to jmdict. That would put an end to a lot of the arguments over whether to include particularly quirky or obscure forms. We could stick common typos in there as well."

It's an interesting thought. I actually have something like that as part of WWWJDIC. If you look up おはよぅ you get a "See: お早う" link. That's not coming from a dictionary; it's from a collection of potentially useful cross-references (mostly inflected verbs,)

I'm not sure we really want such things embedded in JMdict itself, but it's worth discussing.

Marcusjmdict commented 2 years ago

I don't think it's necessarily a bad idea, but I wouldn't want to see it used to frequently for every conceivable typo (おはよぅ for example). There is an argument to be made for separating "irregular" and clearly "incorrect" (as in mistaken) use, and giving dictionary apps a choice whether to display them or not, I suppose.

JMdictProject commented 2 years ago

I've been thinking more about this, and I think it both has merit and can be relatively easily implemented. We'd need new tags, e.g. hK and hk, which would be alternatives to iK/ik and would signal to app builders that the editors consider the forms should only be used as lookup keys and should not be displayed.

Marcusjmdict commented 2 years ago

I've thought about this too but I think I've come to the opposite conclusion.

  1. We can't know dictionary app/site developers will actually hide these forms and I suspect the admins of many existing ones might not bother updating their codebases to do so. (This is just speculation of course) Arguably ik and iK forms should be hidden from view in the first place, unless they are the exact search word used.
  2. I don't think having these tags would accomplish much that we can't really do already with iK, ik and rK (and poss. a new [rk]) - it's all about how we use and define them. Robin has suggested that the "i's" should not be used for 変換ミス's like 3蜜 for 3密 whether they're common or not but I don't think this a) actually reflects how we've been using them up until now and b) I don't think the arguments are very convincing.
robinjmdict commented 2 years ago

We can't know dictionary app/site developers will actually hide these forms

This is one of the reasons I suggested separate fields (as opposed to tags). Makes it much less likely that a developer would include these forms with the non-hidden ones. I recognise this is harder to implement, though. Would probably have to be part of JMdict NG.

I don't think having these tags would accomplish much that we can't really do already with iK, ik and rK (and poss. a new [rk]) - it's all about how we use and define them.

I think we need to be careful not to overload the tags. The idea with hidden forms is that they're never displayed to users. We wouldn't want to hide a common ik form like ふいんき (雰囲気). A user should be able to see that reading and know that it's non-standard. rK forms aren't irregular and most websites/apps would want to display them for the sake of completeness.

Having hidden forms could also be a way to make entries less cluttered. In our pursuit to make JMdict as comprehensive as possible, many entries have become rather messy-looking. If a term consists of multiple words that can be written in either kanji, hiragana or katakana and with optional okurigana, the number possible forms becomes very large. Some of our entries have 8+ forms, and while they should all be searchable, I don't think they all need to be visible, at least not by default.

A lot of (if not most) JMdict-based websites and apps relegate all but the first form to a list at the bottom of the entry. Ideally, forms like 代える, 換える, 替える should be given equal prominence (i.e. displayed at the top like they are in the JEs and kokugos) but it's not practical to do this when an entry can have 5/6/7/8+ forms.

Robin has suggested that the "i's" should not be used for 変換ミス's like 3蜜 for 3密 whether they're common or not but I don't think this a) actually reflects how we've been using them up until now and b) I don't think the arguments are very convincing.

I think there's a simple test. If the writer can recognise their 変換ミス as a mistake, it's not an irregular form; it's just a typo. Some irregular forms are born out of 変換ミス but that's usually because the kanji are obscure or difficult to distinguish from one another. I don't think that's the case with 蜜 and 密. I'm no kanji master and I can immediately tell that 3蜜 is wrong.

Marcusjmdict commented 2 years ago

ふんいき is a pretty compelling example of a reading that is considered incorrect but is so very common and everyday that it shouldn't be hidden away. I have some reservations still but I withdraw my objection.

I had missed/already forgotten you suggested a separate field for these - that strikes me as a very good idea.

JMdictProject commented 2 years ago

This topic has been discussed a bit in issue #63, where I said I'd open a separate issue, but of course, there is no need to do so. It has also been discussed recently in the 買い物 entry (https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1589730).

Hidden fields are alive and well in online dictionary systems. If you look up 手こずる in the KOD interface to the GG5 you are taken to its 手古摺る/てこずる entry. Nowhere in that entry is 手こずる mentioned but Kenkyusha obviously has it in the database of indices (we have it as the primary form for that entry.)

As I understand it, we are considering mechanisms for classifying and recording some surface forms such that app/website developers receive a strong message that the forms are considered unsuitable for inclusion in the regular display of an entry, due to them being irregular and/or of low frequency. The forms can, of course, be used for searching for the entry. Another concern is that these surface forms do not "clutter" the interface of the development/maintenance database (JMdictDB.)

There seem to be two ways these "hidden" forms could be identified within the JMdict stem.

The first approach would be to create new tags to label these forms. Something like [hK] ("kanji form primarily for searching") and a kana equivalent. The advantage of this approach is that it could be implemented almost immediately with a simple addition to a database table. No software changes would be required in the database system and there would be no change to the structure of the main XML distribution, or others derived from it, such as the EDICT2 format. A disadvantage would be that it does nothing to alleviate perceptions of "clutter" in some entries, in fact it may add to it if it made it easier and more acceptable to add marginal forms. It could also be seen as overloading the tag system, which Robin cautioned against earlier in this issue.

The second approach is to modify the database to have one or two more groups of "hidden" surface forms in addition to the regular Kanji/Reading groups. This would mean quite significant software changes which would either have to be included in the forthcoming "NG" development or be carried out at a later stage. It would also mean a modification to the structure of the JMdict XML, although it is possible that the "hidden" forms could just be included with the others along with suitable tags. Depending on how the hidden forms actually appear in the XML there may have to be significant software changes in the software using the JMdict XML distribution.

My recommendation for the way forward with this issue is:

A. Do an initial implementation of the hidden form capability using the present system with hK/hk tags. This would give us a chance to "move forward" on the issue and actually see what level of clutter it adds. We could also start preparing app developers et al. for the arrival of search-only forms (I wouldn't mind experimenting in WWWJDIC on ways of hiding such forms.)

B. At the same time explore the addition of the separate handling of the hidden forms to the JMdict NG development. At this stage we don't know the extent of the additional work required. The timing of the NG development is not clear as it depends totally on Stuart McGraw's good offices, and he is very busy at the moment.

JMdictProject commented 2 years ago

While we are working on this issue, it would be good to clear a number of pending edits that involve the proposed removal of forms that I think would be possible candidates for "hidden" status.

It can be difficult to identify entries where forms have been deleted. As an interim measure, I propose to include in the Comments field a line beginning "HiddenForm" followed by the candidate form(s). For example, the 一杯 entry has a pending edit for the removal of the 一ぱい and 1ぱい forms. I'll approve that edit, and also include a HiddenForm comment to enable it to be detected later if appropriate,

stephenmk commented 2 years ago

I've mentioned this before and it's a bit of a nitpick, but I'm still not a fan of the "hidden" wording. I think it's too prescriptive or at least overly presumptuous of how the data is to be used. I'd rather say something descriptive like "low-priority orthographic variant" or even "omissible." To say that some forms are "hidden" implies that all the other forms are "not hidden," neither of which is going to make sense in every use case.

I've been leaving "HiddenForm" comments on some entries that may apply. As we gather more candidate forms, it should hopefully become easier to decide more precisely what the criteria for inclusion should be.

JMdictProject commented 2 years ago

I agree that when we get to distributing editions with these forms identified in some way it would be best to use something other than "hidden". Something indicating low-priority might be best.

Marcusjmdict commented 2 years ago

I think we've sort of informally decided we're implementing a hiddenForm type of thing. I guess I'm OK with it, but I really really don't think they should be listed in the kanji/readings field with a hiddenTag in the jmdictdb interface, but instead given a separate input box underneath "readings" or even at the very bottom underneath "comments", where several readings as well as kanji forms can just be thrown in.

JMdictProject commented 2 years ago

There are three distinct aspects to these "low-priority", "use-only-as-search-keys", etc. forms:

Marcusjmdict commented 2 years ago

I don't think there's any point in differentiating between kanji and kana forms here, it just sounds like unnecessary work for something that's not going to be visible to the end-users anyway. As such, I don't think they should be included with the regular forms. I would prefer we didn't implement any reading/kanji tags for this but instead waited until the JMdictDB interface can be amended, whenever that may be.

yamagoya commented 2 years ago

There's a purely technical reason to distinguish between kanji and kana for the hidden items: the JMdictDB database (and probably many databases of XML users) store the kanji and kana strings in different tables and the strings need to go into the right table. If the user entering them doesn't distinguish between them, the code has to. There already some contexts in JMdictDB when this is done but there are corner cases when it does not work reliably (https://gitlab.com/yamagoya/jmdictdb/-/issues/26). So it seems simpler to let a user make that determination when entering or editing the entry (by, for example, having two separate input boxes for hidden kanji and kana, or with tags in the existing Readings and Kanji boxes) and to carry that information forward through the XML to other developers.

JMdictProject commented 2 years ago

I quite agree, Not differentiating between kanji and kana forms actually adds a lot of complexity to the downstream handling of entries. The relatively simple tagging I suggest means little if any change is needed in this area. It's not just a matter of the JMdictDB interface - it's the XML structure too and everything that uses it.

JMdictProject commented 2 years ago

I have discussed this matter with Stuart in some depth, and we are agreed that these forms should reside in the database with the other reading and kanji fields, and should be distributed in the XML file in the existing elements with appropriate details identifying them. This means little or no change to the system.

With regard to the UI, it is certainly possible to have these forms handled in their own text boxes, perhaps beside or under the regular ones, but that should wait until the completion of the NG development. For now, the special forms can be handled via the existing interface using our normal tagging process.

I propose to add the following to the kanji and reading info tag sets:

Once that's done the "HiddenForm" info in the 60+ entries that have it can be inserted and/or tagged.

While I'm at it I'll remove the unused [uK] reading-info tag.

robinjmdict commented 2 years ago

As pointed out in an earlier comment, one of the problems with this approach is that unless developers update their codebases, these forms will be included among the existing forms, resulting in even more clutter than there was before.

Also, if search-only forms introduce restr tags that otherwise wouldn't exist (for example, a kanji form that applies to a particular reading), it requires developers to write some logic to hide these restrictions.

It might be best to hold off on introducing search-only forms until JMdict NG, even if that's a long way away.

JMdictProject commented 2 years ago

As we've only identified search-only/hidden forms for about 70 entries (out of almost 200k) I don't think clutter is a problem at this stage. I'd like to get the downstream structure in place and a handful of entries using it so that developers can try things out. I'll be contacting the main ones (jisho.org, Imawa, etc.) to make sure they're aware of the move. I'd like to do some work on them within WWWJDIC so having them live in a sample of entries would help.

Since we're effectively recommending that these forms are not (or needn't be) displayed, they really shouldn't participate in the restriction structure (discussed a bit in the もん/もの issue). In fact, these forms should have no other tags.

I also need to draft a section for the Editorial Policy page explaining it.

robinjmdict commented 2 years ago

Since we're effectively recommending that these forms are not (or needn't be) displayed, they really shouldn't participate in the restriction structure (discussed a bit in the もん/もの issue)

This would be an issue for pop-up dictionary browser extensions like Yomichan, 10ten, etc., which only display the entry information that's relevant to the surface form that's under the cursor. If 買いもん is tagged as a search-only form but not linked to any particular reading, these extensions would be forced to show both かいもの and かいもん as readings for this form, even though only the latter is valid. What's more, かいもの would lead, adding to the confusion.

I forgot about this when I made my comment on the 買い物 entry about restriction tags and hidden forms.

Edit: "Confusion" might be overstating it; pop-up dictionaries would simply display the entire entry instead of just parts of it. But it's still unexpected behaviour.

JMdictProject commented 2 years ago

Yes, it's going to depend a lot on how downstream apps and sites handle these forms. For the glossing function in WWWJDIC I use the complete entry, and when the match is on a derived form I preface it with "XXXX from [entry]". For these ones, I'll probably make it "XXXX linked to [entry[" or something like that.

JMdictProject commented 2 years ago

OK, I have updated the database tables so that the [sK] and [sk] tags are available. I'll close this issue now and open another dealing specifically with them.