JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Handling お~/ご~/御~ variant forms. #79

Open JMdictProject opened 1 year ago

JMdictProject commented 1 year ago

In the お茶を濁す entry I commented:

お茶を濁す 46825 95.0% 茶を濁す 2443 5.0% I wonder if we can drop the 茶を濁す entry altogether, and just have it as an sK form here? Most of the JEs only have お茶を濁す.

We have often discussed the value of including お~ variants as entries. Reference dictionaries don't usually include them, but they are often quite common, such as in: 仕事 112677751 81.9% お仕事 24918265 18.1% and we've usually (sometimes reluctantly) agreed to include the more common ones.

It's occurred to me that the establishment of the [sK] search-only forms could be a way of handling them. It would mean having search forms that don't align completely with the reading of an entry, but it might be useful to consider it as an approach.

robinjmdict commented 1 year ago

I'm uncomfortable with including forms that don't align with the entry's reading, considering that many sites and apps do not yet (and may never) support the sK tag.

I don't think there's any need to add search-only お〜/ご〜 forms to existing entries. Users are unlikely to include the お/ご in lookup queries. Incidentally, I'd rather we didn't have separate お〜/ご〜 entries either (unless they're in the dictionaries). I think they're unnecessary.

JMdictProject commented 8 months ago

This has come up on the 車代 entry. I see GG5 has お車代 as an sK for its 車代 entry. 車代 46359 お車代 16072 御車代 1857 In some ways having お車代 searchable in the 車代 is less messy than having two entries.

stephenmk commented 8 months ago

I guarantee that "くるまだい【御車代】" is going to start proliferating throughout the various JMdict-based websites and apps.

These search forms aren't making it easier for anyone to find this entry, so there's really no point. It should be sufficient to add sense notes explaining that the word is usually 御〜 for particular meanings.

Marcusjmdict commented 8 months ago

I don't have data on what people are actually looking up but I wouldn't be surprised to see lots of people including the お/ご if that's how they encounter it. I understand the objections but do we have data on how many apps and dictionary sites that are still updating their source files (probably only a minority to start off with) that have not implemented sK/sk yet? When it comes to websitew, I would assume jisho.org has cornered probably 90% or more of the "market", and they've implemented both as I understand it.

stephenmk commented 8 months ago

I think it's fair to assume that any app designed for performing Japanese dictionary lookups is going to include rudimentary text parsing functionality.

Jisho has no trouble parsing 車代 from 御車代 ![jisho_kurumadai](https://github.com/JMdictProject/JMdictIssues/assets/8003332/f0bed8d7-e721-4fe9-a269-54d623da647d)
Neither do yomichan derivatives ![yomichan_kurumadai](https://github.com/JMdictProject/JMdictIssues/assets/8003332/fb02a838-9fc6-42b6-8531-c6b942c13897)

Search-only fields are useful for associating search keys that are not trivial to associate. I don't think it would be fair to assume that apps could find an entry like "猫踏んじゃった" from a search query like "ねこ踏んじゃった." Doing so would require parsing the input, searching for different possible combinations of surface forms, or otherwise involve some clever regex / partial text patterns. I'm not aware of any readily available off-the-shelf technology for facilitating these kinds of conversions.

Using the search fields to direct from 御車代 to 車代 makes about as much sense as adding な-suffixed search fields to な-adjective entries. It's a simple A+B operation that anyone with an elementary understanding of Japanese is already well aware of. Unlike the "猫踏んじゃった" example, there aren't multitudes of combinations to consider in order to find the main entry; apps and users know that you just need to drop the 1-character prefix.

I wouldn't argue against having a dedicated entry for おくるまだい. The editors of daijisen think that 御車代 is important enough to record in its own entry. Having these sorts of entries is, for example, useful for applications which use JMdict data for applying furigana to texts. Notably, the screenshots of jisho and yomichan that I included above both mistakenly apply ご to 御.

The furigana use-case is one example of why I'm adamant that search forms should still align with entry readings. It's one thing to have a search form use づ or ー with a reading that has ず or う, as these characters are basically interchangeable. I think adding characters to search forms that aren't reflected in the readings would be spoiling the consistency of JMdict data, and the reasons for doing so aren't compelling.

robinjmdict commented 8 months ago

I agree with Stephen.

It's often reasonable to have a separate entry for a common お/ご/御 form if it's the usual way of writing the word (or certain senses of the word). We already do this. I have no objections to a dedicated お車代 entry.

But otherwise I don't think we need to be concerned about the searchability of these forms.