Making Japanese-only initialisms more searchable.

JMdictProject commented 1 year ago

In the JK/ジェー・ケー entry (https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=2739180) the issue was raised that the entry could not be found by searching for "JK" in the English meaning fields. This is true, as JK is not an English initialism - there are several like that: 1LDK, LDK, hoge, etc, In my comment I said "Better that apps and sites that use the data take on the issue of searching for these sorts of strings."

In recent days I did a proof-of-concept extension to the WWWJDIC server to handle these sorts of cases. I got the extension working, but I haven't released it for a few reasons:

I ran into problems with the indexing of the 全角 alphanumerics. When I devised the indexing system it uses (several decades ago) I made the alphabetics and kana match against equivalents (かな matches against カナ) but didn't do the same for the 全角 alphabetics (1ldk doesn't match 1LDK). I was able to tweak the indexing but implementing that change all over the planet is a bit scary.
the operation of the extension is messy. If you search for "hoge" you could be searching for an actual English term as well as the Japanese initialism. It creates a fork in the search process for which is difficult to display the results without ending up with a mess.
in fact, only a very few entries have this issue. The vast majority of entries with 全角 alphanumeric headwords also have the string in the Meanings fields (AIDS, BBC, etc.). I could only find a handful of entries, such as "1LDK" where this was not the case.

On reflection, I think maybe the best solution is to put faux "meanings" of "1LDK" etc, into the entries and avoid all the complexity of trying to have special software for extreme edge cases.

Marcusjmdict commented 1 year ago

I don't have the time right now to properly explain my position but I just want to voice that I'm not at all a fan of this solution. Having "JK" as a gloss in the ＪＫ entry is not only awkward but I would even call it incorrect, because the implication is that "JK" is actually used in English that way. We have a lot of Japanese users too (through Weblio etc.) so I really don't think this works.

On Fri, May 12, 2023, 4:39 PM JMdictProject @.***> wrote:

In the JK/ジェー・ケー entry ( https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=2739180) the issue was raised that the entry could not be found by searching for "JK" in the English meaning fields. This is true, as JK is not an English initialism - there are several like that: 1LDK, LDK, hoge, etc, In my comment I said "Better that apps and sites that use the data take on the issue of searching for these sorts of strings."

In recent days I did a proof-of-concept extension to the WWWJDIC server to handle these sorts of cases. I got the extension working, but I haven't released it for a few reasons:

I ran into problems with the indexing of the 全角 alphanumerics. When I devised the indexing system it uses (several decades ago) I made the alphabetics and kana match against equivalents (かな matches against カナ) but didn't do the same for the 全角 alphabetics (1ldk doesn't match 1LDK). I was able to tweak the indexing but implementing that change all over the planet is a bit scary.

the operation of the extension is messy. If you search for "hoge" you could be searching for an actual English term as well as the Japanese initialism. It creates a fork in the search process for which is difficult to display the results without ending up with a mess.

in fact, only a very few entries have this issue. The vast majority of entries with 全角 alphanumeric headwords also have the string in the Meanings fields (AIDS, BBC, etc.). I could only find a handful of entries, such as "1LDK" where this was not the case.

On reflection, I think maybe the best solution is to put faux "meanings" of "1LDK" etc, into the entries and avoid all the complexity of trying to have special software for extreme edge cases.

— Reply to this email directly, view it on GitHub https://github.com/JMdictProject/JMdictIssues/issues/95, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUCQII4OGARACYFGIC6EUG3XFXSL7ANCNFSM6AAAAAAX7ESQDM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

JMdictProject commented 1 year ago

On reflection, faux "meanings" really shouldn't be needed. In the case of WWWJDIC, I realise I have a relatively simple way of making 全角 alphanumeric headwords searchable as though they were regular non-Japanese text so I'll take that path instead.

It still may be an issue for other platforms, of course.

JMdictProject commented 1 year ago

I realise I have a relatively simple way of making 全角 alphanumeric headwords searchable as though they were regular non-Japanese text

That's now working, e.g. https://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MDE2dk

JMdictProject commented 1 year ago

Since I fixed the issue within WWWJDIC I think I've demonstrated that the dictionary clients can deal with it, and I'll close the issue.

In closing, I'll mention how I did it. WWWJDIC uses the "edict2" format of the dictionary entries, so for the 2dk entry it has "２ＤＫ [にディーケー] /(n) two rooms and a combination dining-kitchen/". At the end of these entries, I add some extra fields that the server uses but which are not visible to users. For the entry in question the actual contents are: "２ＤＫ [にディーケー] /(n) two rooms and a combination dining-kitchen/EntL2832153 SrcH 2dk/" The "EntL" field enables the server to link to the database for edits, etc. and the "SrcH" field contains non-visible keys which can be searched for. I create them for the edict2 version used by WWWJDIC when the first headword is entirely 全角 alphanumeric.

JMdictProject / JMdictIssues

Making Japanese-only initialisms more searchable. #95