JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Search-only forms #75

Closed JMdictProject closed 1 year ago

JMdictProject commented 1 year ago

As mention in the (closed) #46 issue, the [sK] and [sk] tags are now active. Their text descriptions are:

To enable a bit of initial downstream testing I have converted the "hiddenform" forms in two entries (3密 and カネロニ) to the tagged versions. I'd like to ask that contributors: (a) stop using the "hiddenform" flagging and instead directly propose [sK] and [sk] tags if any more come up; (b) hold off from converting any more flagged entries until I've tested it a bit.

I'll draft a section for the Editorial Policy and also a general notice particularly aimed at app/online site developers.

JMdictProject commented 1 year ago

I have:

I'll put out a message on the mailing list too.

robinjmdict commented 1 year ago

I think it would be helpful to list (or at least give a few examples of) the types of forms that should be tagged as sk/sK. Off the top of my head: kanji typos (変換ミス), uncommon 混ぜ書き forms, uncommon itaiji and kyūjitai, uncommon irregular okurigana forms, uncommon irregular readings.

The new section includes this sentence:

It is suggested that developers of dictionary apps and sites use these forms for searching purposes, but not show them as part of the full entry.

I would be more explicit and say that these forms should never appear as part of the full entry. They don't participate in the restriction structure and there's nothing to indicate whether they're irregular (e.g. 変換ミス) or just uncommon (e.g. itaiji). Displaying them alongside the non-sk/sK forms would be confusing/unhelpful for users.

I'll put out a message on the mailing list too.

Do we also need to contact Weblio? It's probably the largest site that uses JMdict data. As an aside, I've noticed that Weblio doesn't support our newer misc tags (e.g. dated, net-sl, hist). It also doesn't support any field tags or restr tags, which is a more significant issue...

JMdictProject commented 1 year ago

Thanks. I've beefed up the section in question to contain quite a few of your suggestions. I may make it a separate page altogether as developer-oriented recommendations are not exactly editorial policy.

I haven't been in touch with the Weblio people for many years so it may be a fresh start. I think they reload from JMdict about once a month. I wasn't aware they weren't tracking recent tag changes, but perhaps that's because of the update cycle.

robinjmdict commented 1 year ago

The timestamped pseudo-entry shows that Weblio's JMdict data is from 26th July 2022, so presumably they're updating it monthly. It's most likely an automated process and they're not even aware of tag changes. Supporting new/changed tags involves a little more work for Weblio than most sites/apps because they have to be translated into Japanese.

What's more concerning is that restriction tags are completely ignored. Apparently 切断 is read さいだん...

Screenshot 2022-08-16 at 11 19 38

They don't even lead with せつだん, which suggests they just list all the readings in the entry in 五十音 order.

stephenmk commented 1 year ago

Nitpick: our handling of mazegaki forms containing hyougai kanji is currently a little uneven.

へ理屈 is hidden on entry 1566360 (屁理屈), but 流ちょう is not hidden on entry 1552430 (流暢). Both of these mazegaki forms have about 10k counts and ~5% relative usage.

robinjmdict commented 1 year ago

Yes, we need to decide on a threshold for sK/sk. I suggest something reasonably high like <10% of the total n-gram counts of all the surface forms for the entry. Any threshold would, of course, only apply to forms that belong to one of the categories listed above.

I also want to ask about forms that only differ in kana usage (e.g. タカ派 and たか派, 地べた and 地ベタ). Jim has argued for the inclusion of these forms on the grounds that it helps text glossers. But given that literally any word in Japanese can be written in katakana or hiragana, is it not reasonable to expect the glossers to handle the conversion themselves? Surely they should be able to parse hiragana and katakana as though they were the same. Pop-up dictionaries like Yomichan and 10ten do this just fine. Even something silly like タか派 is matched to the correct entry.

I don't see the benefit of including obscure hiragana/katakana variants (even as search-only forms) to help out text glossers if we're not going to do it for all words. None of the forms below is in JMdict, and they're all far more common than たか派 or 地ベタ. I don't think it's practical to try to record all these forms.

n-grams
ドコ 789083
ナゼ 295828
アタマ 727541
シゴト 641176
クダサイ 89958
ヤメる 11938
ウツクシイ 5812
n-grams
てれび 242879
ぱそこん 103141
すーぱー 70832
かめら 36753
いらすと 18108
たいぷ 17995
どらいぶ 16394
JMdictProject commented 1 year ago

Nitpick: our handling of mazegaki forms containing hyougai kanji is currently a little uneven.

Well, it's really case-by-case. I kept it off 流ちょう as a result of the dialogue about it. For 屁理屈 I added it because Robin had flagged it - I hadn't actually noticed that 屁 is 表外字.

I think it's appropriate to keep 混ぜ書き forms involving 表外字 visible unless they are quite rare, e.g. <= 1% or so.

robinjmdict commented 1 year ago

I think <= 1% is too low. The most important thing is that 交ぜ書き forms are searchable. I don't think it's necessary to have them visible, except to tell users "this is a common way of writing this word". Hiding them is unlikely to cause confusion. In my previous comment I proposed a 10% threshold. For 交ぜ書き forms specifically, I think that's already quite low.

Jim, can I hear your thoughts about the kana usage point I raised above?

stephenmk commented 1 year ago

According to the google n-gram counts, 虎視たんたん is ten times more common than 虎視タンタン, ビン詰め is three times as common as びん詰め, and 流チョウ is never used. These could arguably be good reasons to display mazegaki forms to users.

But perhaps the google n-gram counts alone aren't always clear or authoritative enough for us to make assertions about which forms are preferable. If these questions are important enough to users, they might be better off consulting their company style guide or something rather than JMdict.

Putting that aside, I also don't think there's much value in displaying these rare mazegaki forms explicitly. It's already fairly simple to indicate the presence of hyougai kanji to users (as WWWJDIC does). That alone should be enough to signal that those kanji won't be seen in certain contexts. 10% sounds like a reasonable threshold to me.

There's always the option of leaving these decisions up to app developers by making these metadata tags more fine-grained ("rare mazegaki", "rare irregular", "rare itaiji", etc.), but I'm not so sure there's much enthusiasm for that among contributors or app designers.

JMdictProject commented 1 year ago

Robin wrote:

I also want to ask about forms that only differ in kana usage (e.g. タカ派 and たか派, 地べた and 地ベタ). Jim has argued for the inclusion of these forms on the grounds that it helps text glossers. But given that literally any word in Japanese can be written in katakana or hiragana, is it not reasonable to expect the glossers to handle the conversion themselves? Surely they should be able to parse hiragana and katakana as though they were the same.

Parsing text which does not include word boundaries is an ancient problem - the Romans struggled with it. Japanese which is written mostly in kana is a big challenge as there are so many homophones. My particular interest is in supporting the glossing function in WWWJDIC, which is quite old and antedated the sophisticated ML-based morphological analysis systems that are around now. In WWWJDIC I use a greedy algorithm to try and find the longest matches from the last known term. For forms with kanji this is generally OK, but with kana it's quite a challenge. In the case of hiragana, I use the "uk" terms in JMdict plus an extras file. For katakana, it's generally OK as they are sparser and a hiragana->katakana switch generally means a word boundary. That's why the glosser is sensitive to hiragana/katakana differences, unlike the general single-term lookup where a search for たか派 will match with タカ派.

It's fair to ask why I'm persisting with this old and clunky method - well it works reasonably well provided it's got the terms to chew on. Stripping stuff out degrades it, which is why I want to keep forms like たか派 available. Hiding them away as sK/sk terms is fine as I can use them during glossing,

Why don't I upgrade to a more modern parser? Well I have experimented with that; the "new-generation glossing system" option on the edrdg.org server uses the MeCab parser and Unidic morpheme lexicon instead of the old approach. It mostly works fine but I've never quite got all the bugs out of it, and the quirks of installing MeCab/Unidic on different platforms makes deploying it a bit messy. Maybe if I can shake up some spare time I'll get it into an acceptable state, and I'll be less concerned about the coverage of kana forms.

JMdictProject commented 1 year ago

Kim has changed the jisho.org server so that the sK/sk-tagged forms are now hidden. See: https://jisho.org/search/%E4%B8%89%E8%9C%9C

It's looking good. An interesting issue is that since the lookup key is 三蜜, it displays the details of those kanji and not 三密, etc. A challenge as to how best to handle that situation.

JMdictProject commented 1 year ago

Just a quick heads-up that the Aedict (Android) dictionary app now handles the sK/sk tags. It won't show the forms but uses them for searches.

The change was in fact made in the tailored dictionary file it uses.

I haven't heard from the imawa contact yet.

birtles commented 1 year ago

10ten Japanese Reader (Rikaichamp) and hikibiki have also been updated.

stephenmk commented 1 year ago

Would odd unicode glyphs such as ㍻, ㍼, ㍽, ㍾, and ㍿ be appropriate as search-only forms? We currently don't include these in jmdict.

see: https://en.wikipedia.org/wiki/CJK_Compatibility

JMdictProject commented 1 year ago

I don't think they'd cause a problem but are they needed? Is anyone really like to come across ㍽ in a context where its meaning is unclear? Ditto for something like ㍿.

stephenmk commented 1 year ago

I wouldn't put it past myself to be confused by it. I might go searching for an 8-stroke kanji composed of 大 and 正.

It doesn't seem like they're very common, but they are suggested by my IME.

Glyphs such as ㌀ are probably a separate problem since they're not quite readings and not quite kanji.

birtles commented 1 year ago

For what it's worth, 10ten expands these forms at the same point where it converts half-width katakana to full-width, decomposed characters to composed characters, enclosed characters to regular characters etc. and we'd probably prefer to keep doing it this way where possible since it lets us know which actual reading we matched on (so we can highlight it appropriately).

stephenmk commented 1 year ago

I think that's a good point. There are so many of these special combined / enclosed characters that it's probably better to leave it to the apps to parse them.

briankrznarich commented 1 year ago

Rough question: Should "canonical" kanji forms be explicitly excluded from [sK], regardless of commonality?

I've read this thread, the other threads including sK, and the editorial policy to try and catch up. Having written all my thoughts, then re-read this thread and the editorial policy a second time, I'm pretty sure the issue I'm raising here should not be an issue at all... (Apologies if this should have been a new thread, it seemed applicable to general sK policy).

I had a small disagreement on the inclusion of a kanji form 思う壺に嵌る。The kanji for 嵌る(はまる) was not included in any surface forms, so I added it(it was omitted from the initial entry from long ago). It was accepted, but marked [sK]. One particular issue raised is that 壺 has 3 very-common surface forms, so just listing 思う壺, 思うつぼ , and 思うツボ has already cluttered up the entry. For reference, we agree that 嵌-using forms represent perhaps ~8% of real-world use.

思う壺に嵌っ    │ 228 │ 5.3% │ 思うツボに嵌っ   │ 55 │ 1.3% │ 思うつぼに嵌っ   │ 34 │ 0.8% │ This one uses 壷: 思う壷に嵌っ    │ 33 │ 0.8% │

Discussion here: http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2222152

But regardless of clutter, I don't see how the full kanji form 思う壺に嵌る falls into any of the currently proposed [sK] candidate categories:

This is the most-common full-kanji form. The characters themselves (both 壺 and 嵌) are not jouyou, but are not uncommon either. Is this not sufficient to exclude 思う壺に嵌る from consideration as [sK]?

============= Related policy question on "combinatorial" multi-kanji terms... Should any "valid" kanji with reasonable use get at least one visible surface form?

つぼ has two kanji forms 壺 and 壷. Mercifully, use of 壷 in this particular expression is particularly rare, so I don't think [sK] on it is any great loss. But if it were a more common, this would invite the possibility of つぼ・ツボ・壺・壷 x はまる・嵌る (8 entries). In that case, there is so much clutter that I would completely agree that the less-common 壷 should be at least mostly suppressed. But would it be appropriate to show no surface forms with that kanji?

As a user I usually find it odd when a kanji variant appears in one term, but not in another. "Why are 壺 and 壷 synonymous, but 壷 is not used in this expression...?" For example, right now "思う壷" is a visible form for the "思う壺" entry, but 思う壷にはまる is now [sK]. This raises questions when seeing them side-by-side, as I regularly do when browsing the dictionary by kanji(as kanji studiers do).

Sometimes there is a reason that a kanji variant should not be used in an expression, and that is worth being aware of. But if there isn't a reason, and the variant kanji is used with any sort of reasonable frequency, I think it might be valuable to have the kanji appear in at least one surface form(Even now, 壷 appears only once in the entry, not in every conceivable combination, but it is [sK]).

In this case, it looks like 壷 is so rare here that it's barely applicable and I don't care one way or the other, but as a policy matter it still might be worth considering.

stephenmk commented 1 year ago

I noticed that Sanseido's kokugo encloses uncommon kanji forms in expressions with parentheses to indicate that they are usually kana. So for example, おもうつぼ is presented as [思う(壺)]. That sort of setup seems ideal to me, although I assume it's probably not feasible to implement in the current format of JMdict.

My understanding of the "canonical" kanji situation is that they've never been a priority in [exp] entries. See for example Robin's comment on entry 2855594, 「そこに山があるから」:

We don't usually include rare kanji forms for words like そこ on exp entries, even if the kokugos do.

So before we had [sK] tags to work with (August 2022), these rare kanji expressions just wouldn't be added at all. It would be pretty inconvenient and messy if we added 其処 to all expressions with そこ, 有る to all entries with ある, etc.

But in the case of 「思う壺に嵌る」, 「嵌る」 is in ~8% of the n-gram counts and isn't exactly rare. I think ideally we would just display something like [思う(壺)に(嵌る)], so any other solution is going to be less than ideal IMO. However, I'll note that there's only one entry for 「はまる」 in JMdict, so there's no ambiguity that this is a form of 「嵌る」. I see that the entry for 「思う壺」 in gg5 has 「嵌る」 written in kana in its example sentence as well.

robinjmdict commented 1 year ago

As Stephen says, we tend to omit (or hide) rare kanji forms on [exp] entries to reduce clutter. But I agree that 嵌る is too common (in this expression) for sK. We don't need it on multiple forms but it should be visible somewhere.

As an aside, it's interesting that despite being an [io] (irregular okurigana) form, 嵌る has significantly higher n-gram counts than the standard form.

嵌る 62679 嵌まる 12104 思う壺に嵌まっ 38 思う壺に嵌っ 228

JMdictProject commented 1 year ago

For expressions I'm not convinced we need to include "canonical" kanji forms if they are quite rare. In the entry in question the main form with 嵌る is a bit borderline on frequency but probably worth including. I have just made it [rK] which I think sends the right signal.

I doubt it's worth trying to reflect this in the policy; it should be a case-by-case consideration.

briankrznarich commented 1 year ago

@stephenmk [思う(壺)]is interesting as a presentation format, I wonder what it would take to get downstream systems to implement this.

For my sanseido/sankoku dictionary, the index says of the parenthesis: かっこの中は仮名書きにして(も)い。 So, this looks somewhat weaker than our "usually kana", and more of a "kana would be fine if you prefer" annotation. The x = non-jouyou, ▽ = on the jouyou list, but not with the given 音読み. No apparent annotations for rarity, except the relative ordering of the entries.

@robinjmdict That is interesting. sanseidou also only gives 嵌まる(and 塡まる), and there is a 嵌める 嵌まる pair, so this is especially odd.  

I think that the answer to many questions on modern statistical distributions, and on Kanji popularity, is "what does IME do?" If I type はまる or はまって on MacOS, Windows 10, or Android, 嵌る and 嵌って both show up before 嵌まる and 嵌って. If a kanji is in IME, it has the potential to be re-popularized by Japanese people who like to include kanji. They couldn't hand-write them, probably aren't even imagining "嵌" when they think "思う壺に嵌る", but if it's in the popup list, why not? : ) Might as well look educated.

It does seem the intransitive form is significantly more common than the transitive form, perhaps that plays into it. I'll add to your n-grams:

嵌って 165722 嵌まって 28894 嵌めて 17842

にはまって 1587478 には待って 11032 (likely confounding form in hiragana) をはめて 100959

にハマって 735923 をハメて | 2781

briankrznarich commented 1 year ago

I don't have an inherent issue with this being a case-by-case consideration. I first tried to remove this one [sK] annotation as a one-off request, but Marcus (fairly) thought this was a policy question that needed some discussion here. What we can hopefully avoid is a scenario where every [sK] consideration requires a council meeting to resolve : )

I will say that in my relatively long experience with this dictionary, it is exceedingly rare to have an entry, even an [exp] entry, with no kanji form where at least one is applicable, which is why I phrased my original comment as I did. From outward appearances, this already seems to be the effective policy of jmdict.

If you do an advanced search for the "[exp]" tag and just scroll around, it's nearly impossible to find a kanji-free entry.

This is why I was motivated to add 嵌る, because these are hardly ever missing, and why I have lobbied against [sK] for it.

Since [sK] is new, and it's just starting to be applied to existing entries, I would be sad to see a lot of kanji information "fall out" of the dictionary in the name of reducing clutter. That's all.

====

As an aside to something I missed earlier, I don't think that ある・有る, ない・無い etc. are really fair as a comparison to something like 嵌る. I've seen these come up in prior discussion about [sK] policy. Those are so fundamental to the language that any basic student will quickly learn they are interchangeable (or for そこ 其処, they don't need to be concerned about it), and including their variants unconditionally would destroy the dictionary.

JMdictProject commented 1 year ago

Quiet for 3 months. Time to close.

rampaa commented 10 months ago

Just a heads-up in case anyone cares, JL also started to handle search-only forms as suggested in https://www.edrdg.org/wiki/index.php/Editorial_policy#Search-only_Forms.

JMdictProject commented 10 months ago

Thanks for passing that on.