JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Including symbols/punctuation as headwords in JMDict #117

Closed melink14 closed 5 months ago

melink14 commented 10 months ago

Hi all,

When updating the dictionaries in rikaikun this week I noticed that the 〜 was removed (mostly because it's one of the first entries so it was near the top) .

Checking the history for the item (harder than expected since the headword was removed), I found it was removed intentionally after some discussion on whether symbols belong in the kanji field.

The argument against them is captured in Stephen's final comment:

I don't think they're useful as dictionary entries. On the contrary, they seem confusing and sometimes plainly incorrect ("仝" as "どうじょう" looks very suspicious).

にょろ is one possible name among many (波線符号, チルダ, 波ダッシュ). GG5 has a 電算俗 tag in its entry for にょろ, so it sounds like it's actually a somewhat slangy name. The wiktionary entry for the symbol doesn't even mention にょろ, but it does say that the symbol is sometimes read as 「から」.

More generally, I don't think it makes sense to have names as "readings" for symbols. For example, we don't have 「🔰」 in a kanji field alongside 初心者マーク. And 「おのづくり」 is a name for the right-sided 斤 radical, but we have 斤旁 as the kanji form and not a 「斤」 symbol.

I think the solution of removing the symbol as a headword is a net negative since as Jim said, there is value in having them searchable to people who don't know what they mean. I would actually argue for adding more symbols to the dictionary rather than using missing symbols as a reason that we shouldn't have the ones there now.

From my reading of the concerns, it seems like the main concern is not having the symbols in the headword but using the names for the reading. If that's the case, I would suggest moving the names to the meaning and using the reading for how the symbol would be pronounced if encountered in text while reading aloud. In many cases, that would mean no reading but not always.

I think there's a lot of opportunity to add more detailed usage and meaning information to symbols which would likely confuse somone encountering them for the first time, so hopefully we can find a good solution outside of removing their searchability.

stephenmk commented 10 months ago

I would actually argue for adding more symbols to the dictionary rather than using missing symbols as a reason that we shouldn't have the ones there now.

My point was that we wouldn't add 「🔰」 to the kanji fields of our entries for 初心者マーク, 若葉マーク, and 初心運転者標識, because that would clearly be contrary to our usual style. We removed 「※」 as a kanji form from the entry for こめじるし in 2017 and 「・」 from the entry for なかぐろ in 2019 for the same reason. In the comments for the latter, Marcus mentioned the text glossing functionality of rikaikun as a good reason to prevent the symbol from having its name recorded as a reading.

As I said in my comment, 「〜」 doesn't deserve to be in the entry for にょろ any more than the entries for 波線符号, チルダ, or 波ダッシュ. Allowing people to search for 「〜」 and showing them "にょろ" as its reading is certainly a net negative. "にょろ" is neither the usual name for the symbol nor how it is usually read.

If that's the case, I would suggest moving the names to the meaning and using the reading for how the symbol would be pronounced if encountered in text while reading aloud. In many cases, that would mean no reading but not always.

I don't think anyone is against having a broad coverage of symbols in JMdict. My understanding of the situation is that it's not currently feasible to have the symbols recorded without a reading (although I may be mistaken). If it is in fact technically feasible, I would definitely be in favor of having entries for these symbols as well.

melink14 commented 10 months ago

Thanks Stephen, I think I'm aligned with that (sorry if I misrepresented your points) but didn't realize it was perhaps infeasible to have the symbols recorded without a reading.

Maybe as a workaround we could use a placeholder reading in these cases to denote that there is no reading. (and perhaps being explicit about the 'no reading' case is better since it clearly distinguishes it from a mistake.)

If the reading has to be kana, that would perhaps be difficult as well though...

stephenmk commented 10 months ago

It's good that you brought this up, so thanks for taking the time.

By my count we're currently using these character ranges for readings in the JMdict file.

Characters Unicode Range
0x301c
ぁ - ん 0x3041-0x3093
ゝ - ゞ 0x309d-0x309e
ァ - ヴ 0x30a1-0x30f4
0x30f6
・ - ヾ 0x30fb-0x30fe
0xff5e
0xff80
0xff8b

I've only been assuming that the reading fields couldn't support symbols, but the odd half-width characters make me wonder. It will be interesting to hear what Jim or Stuart have to say.

yamagoya commented 10 months ago

I can't speak for the broader ecosystem but the JMdictDB database and software is pretty liberal in accepting any or no readings (it has to be because it is intended to be able to load Kanjdic and Tatoeba data locally; some of the former and all of the latter have no readings). However, the code expects readings for JMdict (and probably JMnedict entries; don't recall at the moment) and tries to distinguish between reading and kanji in various places (eg the web Search page for text when an explicit search type (reading/kanji/gloss) is not selected). This can be adjusted if needed.

JMdictProject commented 10 months ago

Just noting that I've successfully added an entry with 〜 as the "reading".: https://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MDJ%A1%C1 https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=2859770 Seems OK. It's searchable.

The same could be done with △: https://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MDJ%A2%A4 https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=2273890

melink14 commented 10 months ago

I saw your edit in today's dictionary update and at the same time noticed that the exact same thing had been done awhile ago for ー: https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&e=1922323

Instead of using a cross reference it's using a note to denote the name though it seems a bit inaccurate. (I don't surface those now unfortunately but it's somewhere on my list)

For 〜, I wonder if we want から as a reading since in prose it will often be read that way but not sure if there's a good way then to denote that it is not always read that way... (though I guess before that would be adding more senses beyond the visual one)

JMdictProject commented 10 months ago

There are quite a few entries for.symbols like that: ゝ, ヾ, etc. I think they're useful. Perhaps ・ can go back in.

I don't think I'd read 〜 as から.

stephenmk commented 10 months ago

In 2017, Marcus wrote this comment on the entry for ゝ:

shouldn't we mark all these as [expl], for clarity? ゝ IS a repetition mark, it doesn't MEAN "a repetition mark."

This is an interesting point. Before I saw Marcus's comment, I amended our entry for ヽ to remove the [expl] tag because I didn't think its usage there was consistent with how we normally use the tag.

Naturally you cannot replace the phrase "ひらがな反復マーク" with the character "ゝ", although we might gloss them identically. For clarity, maybe we should have a new [symbol] part-of-speech tag instead of using the unclassified tag for symbol entries. This might be enough to get the point across.

JMdictProject commented 5 months ago

I think this can be closed now. We have quite a few symbols as entries with a POS of [unc] and explanatory meanings.