JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Alignment of nakaguro and commas between kanji and reading fields #122

Closed stephenmk closed 3 months ago

stephenmk commented 3 months ago

We currently have 46 active entries in JMdict which contain comma characters in the kanji field. None of the corresponding reading value have commas, which causes me some difficulty when analyzing JMdict data. 46 is such a small number that I can manually clean up my local copy of the data for my own purposes, but I'm wondering why we don't just add the commas to the main database.

For an entry like 聞くは一時の恥、聞かぬは一生の恥 for example, the reading would be much easier to visually parse if it included a comma as "きくはいっときのはじ、きかぬはいっしょうのはじ." We could also add [sK] tags to the kanji forms without commas.

The same goes for entries with nakaguro characters. We have 118 active JMdict entries with those in the kanji field. I think it would be nicer to have the reading of アメリカ・スペイン戦争 be set to "アメリカ・スペインせんそう." The kanji form without the nakaguro could also be hidden.

hlorenzi commented 3 months ago

It has also tripped up my furigana segmentation algorithm a few times, so algorithmically I think it would indeed be easier if any punctuation was mirrored in the reading field.

There's also this entry which might be the only one with a question mark.

JMdictProject commented 3 months ago

I don't think punctuation characters such as commas or question marks should be included in the readings fields. They certainly serve a purpose in the kanji forms for expressions but I can't see them having a useful role in the readings. I think they'd add clutter and not really assist the vast majority of users.

I haven't written any software which aligns the components of a kanji field with those of a reading field, but if I did I'd simply ignore punctuation characters.

Nakaguro are a little different in that for kana-only terms they need to appear in the readings field. Where they are being used in a mixed kana/kanji form such as in アメリカ・スペイン戦争/アメリカスペイン戦争, I think just the one reading (アメリカスペインせんそう) is enough.

stephenmk commented 3 months ago

Robin wrote in issue #88 (Readings of initialisms):

But with the introduction of search-only forms in JMdict, clutter is no longer the concern it once was. Additionally, there is a benefit to using nakaguro in readings of initialisms: it makes them easier to read.

Surely the same rationale applies here.

Do users really prefer to see something like

きくはいっときのはじきかぬはいっしょうのはじ

rather than

きくはいっときのはじ、きかぬはいっしょうのはじ?

I don't understand how the latter is cluttered and doesn't assist users. To me it seems to be the exact opposite.

robinjmdict commented 3 months ago

Most (all?) entries containing commas and nakaguro also include a form without those characters. From a UI/UX perspective, is it a good idea for the displayed reading to contain commas/nakaguro even if the surface form being queried does not? I'm thinking about pop-up dictionary browser extensions or other tools that don't necessarily display the full entry.

Is there an argument that readings should only contain characters that are pronounced? This was my initial thought. Initialisms seem like a special case to me but I admit I might struggle to defend that position.

I'm not outright opposed to the suggestion. Several of the kokugos use hyphens to segment readings of compound nouns. (Although for something like 二・二六事件, the hyphen is placed between にろくand じけん, not the にs.)

stephenmk commented 3 months ago

I'd rather see 'かりるときのじぞうがおなすときのえんまがお' with a comma than without, even if I looked up the expression '借りる時の地蔵顔済す時の閻魔顔' without any commas. I think expressions that are this long are unlikely to be searched exactly verbatim in any case; users can probably only arrive at these entries via wildcard searches or collocation listings. I don't see a lot of value in keeping the non-comma forms visible, and for that reason I hid that one in particular about a year ago.

Practically every recently published kokugo (Koj 7e being maybe the only notable exception) displays readings inline with the long expressions rather than displaying them one after the other. E.g. daijisen has 済(な)す時(とき)の閻魔顔(えんまがお). That seems to be the ideal solution from a UI/UX perspective. Jitenon displays the reading and kanji forms separately but has a comma in each one.

I also think having the comma makes sense from the perspective of data quality. It's much easier to strip out an unneeded character programmatically than to find the correct place to insert a missing character that is needed.

JMdictProject commented 3 months ago

If we look at 2844869 we have all but the first kanji form as [sK] and the solitary reading of かりるときのじぞうがおなすときのえんまがお. If we made that かりるときのじぞうがお、なすときのえんまがお instead, I guess there's no real reason to have a comma-free version as well?

Turning to 2836571 we have two kanji forms which are identical apart from the comma. Probably the second could/should be [sK]. and if we go with commas in the reading, probably just the one reading would do.

As I've said, I don't think adding commas to the reading fields is really needed, but it's essentially harmless and if it make some people more comfortable, I can live with it.

I see that about half of the 46 comma entries have that form first. Perhaps we should adopt a consistent style for such entries and have only the comma form visible, e.g.

stephenmk commented 3 months ago

I guess there's no real reason to have a comma-free version as well? Perhaps we should adopt a consistent style for such entries and have only the comma form visible

That all sounds good to me. I think it would be nice to do the same for entries with nakaguro in the kanji forms as well, though I understand there's a technical reason for avoiding that (namely, cross reference formatting in the JMdict XML file).

I'd be happy to do the legwork of updating the entries with comma additions if nobody is opposed.

JMdictProject commented 3 months ago

OK, so what we can do for these expressions, proverbs, etc. which contain commas is:

I have done a worked example at https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=2708140

stephenmk commented 3 months ago

All the entries with commas have now been updated.