JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Uncommon 旧漢字: oK, rK or sK? #77

Closed JMdictProject closed 1 year ago

JMdictProject commented 1 year ago

A contributor has just submitted a lot (100+) of 旧漢字 forms; many of which are for city/prefecture/etc. names. He has tagged them all as "oK". It would be good to have a consistent way of handling them. I sampled a few and although they all seemed valid, none were in 国語辞典. The n-gram counts are quite low,

In the past we would have stayed with the "oK". which is correct, but I think we'd be more likely to use "sK" now as they come under the list of candidates for that tag. Is that an agreed approach?

[If we're not using "oK" in cases like these, it does raise the question of when they are to be used. I suspect that with the arrival of "rK" and "sK" the role of "oK" will be largely confined to forms such as 合氣道 and the like.]

robinjmdict commented 1 year ago

Up to now, our policy has been not to include 旧字体 forms unless they're somewhat common, in which case they're tagged as oK.

Any 旧字体 that aren't sufficiently common can be tagged as sK. But we need to decide on a threshold, which is a discussion currently being had on #75. I suggested <10% of total n-gram counts but that's probably too high, at least for 旧字体. How about <3%?

Ideally, I think this is something that should be handled by the websites and apps themselves so that 旧字体 are parsed as though they were 新字体. It's not a good use of anyone's time to be manually adding and approving obscure, old kanji forms on thousands of entries.

stephenmk commented 1 year ago

Something like 「電子辭書」 looks like an anachronism to me, although to my surprise it gets over 100 n-gram counts. Wiktionary actually displays it as an alternate form.

I suggested <10% of total n-gram counts but that's probably too high, at least for 旧字体. How about <3%?

This sounds good to me too.

Ideally, I think this is something that should be handled by the websites and apps themselves so that 旧字体 are parsed as though they were 新字体. It's not a good of anyone's time to be manually adding and approving obscure, old kanji forms on thousands of entries.

I agree, but maybe we could add these forms and tag them as [sK] with a bulk update. That would save app developers from having to individually re-implement this functionality themselves. But if we do want to do this, it might be better to wait for more apps to adjust to these [sK] forms.

stephenmk commented 1 year ago

On the other hand, this might end up being really messy. Consider 仏国, where both kanji have old versions.

form counts perc.
仏国 26,457 91.2%
佛國 1,475 5.1%
佛国 859 3.0%
仏國 208 0.7%

To make matters more confusing, we have two entries for 仏国 ("France" and "Buddhist country").

JMdictProject commented 1 year ago

A 3% general oK/sK threshold is fine with me. Many of the proposed 旧字体 forms, while valid and to be found in the wild, are in that sK range.

And a bulk update to add 旧字体 forms is possible. It's a bit hard to do them on the fly at the app level - you'd have to inspect every kanji in every lookup key to see if there were alternatives and search on those combinations too.

robinjmdict commented 1 year ago

It's a bit hard to do them on the fly at the app level

10ten does it successfully but yes, expecting all sites/apps to implement this is probably asking a bit much.

Adding them with a bulk update is certainly an option. Whether or not we do this, I think we should contact Cuyler (the contributor of all these 旧字体 forms) via email to tell him that he doesn't need to add any more (or at least not the obscure ones.)

JMdictProject commented 1 year ago

I am in contact with Cuyler and have asked him to compile a list of regular/旧字体 pairs for consideration for bulk update rather than put them in through the interface.

JMdictProject commented 1 year ago

Whoops. I've just noticed that the bulk updater can't add/edit the kinf and rinf fields. I'll ask Stuart if that's a simple addition, but I may need to look for a workaround.

yamagoya commented 1 year ago

I successfully added some trial code for kinf/rinf support to the bulk update tool. I still have to update the doc, do better error handling etc, but I should have something in a day or two (or three) if that works?

JMdictProject commented 1 year ago

Stuart has provided the code and database updates to allow for kinf/rinf support to the bulk update tool. It's installed and tested.

If anyone has significant numbers of potential sK/sk forms to add, please feel free to send them to me.

JMdictProject commented 1 year ago

This one has been quiet for a while. I'll just remind people that these forms can be added in bulk if needed - just send them to me. I'll close it for now.