JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Uncommon kokugo-only terms #97

Open JMdictProject opened 11 months ago

JMdictProject commented 11 months ago

I would like to clarify the policy regarding potential lexical items which are comparatively rare and only appear in the larger 国語辞典. These often appear in submissions and there is occasionally discussion as to whether they are appropriate or needed as entries. A recent such discussion can be found for the proposed 献芹 entry (https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=2858254) and an entry has just been proposed for 社会実装 which is in 大辞泉.

Published (paper) dictionaries, especially bi/multi-language ones, traditionally limit entries to the more common terms in the interest of containing the size of the dictionary and maintaining a focus on the usefulness of the collection with regard to contemporary language, etc. In a purely electronic database such as JMdict, provided appropriate indicators are included for rare, archaic, historical, etc, terms, there is probably no real reason for excluding obscure terms, provided they are attested in reputable sources.

I propose that we accept as entries any terms which appear in reputable sources such as the major 国語辞典, and ensure that where appropriate the relevant [rare], [arch], [obs], [hist], etc. tags are included to indicate their role (if any) in modern Japanese.

stephenmk commented 9 months ago

I have come across many words used in popular media (TV, games, comics, etc.) which could only be found in the larger dictionaries and return low n-gram counts. So I don't doubt that these entries can be useful to people.

I do think it would be helpful if submissions for such words (like 献芹) would include the context in which the submitter encountered the word. This would help in two ways:

  1. Provide evidence that we are glossing words correctly for the contexts in which they are used.
  2. Provide evidence that the entry will actually be useful for someone else in the future.

I don't think it's advisable to go searching the dictionaries for extremely rare words to add to JMdict. Ensuring that new glossaries contain accurate glosses and appropriate tags can require a lot of time and effort.

(I'm aware that I'm always submitting a lot of rare kanji forms for entries that might never be useful for anyone, but in my defense I think the overhead required for those additions is comparatively low)

When we receive submissions for words like 鳥卵学者, 日光反射信号, and 労働紹介所, I wonder if the user actually encountered the words somewhere or if they're just perusing GG5 for new words to add.


The new submission for 台胴 illustrates my point about new glossaries requiring time and effort. The gloss "dado" by itself was completely inaccurate.

briankrznarich commented 8 months ago

I know this isn't how the project envisions jmdict being used, but I use it heavily for kanji-related vocabulary mining. i.e. When I encounter a word with an unfamiliar kanji, I look for related words to learn together(helpful for me). For some terms, only obscure-ish words exist, but for reinforcement purposes I grab the best of what is available.

It would actually be useful if there were a tag that indicated a term were "almost completely useless", or if we could stack [rare] with other terms. Currently [arch], [obs], [form], and [hist] don't admit [rare]. None of these are necessarily [rare]. [hist] of course implies current use, [form] can be exceedingly common, [arch] includes English equivalents of words like "thee" and "thou", which are used heavily in period dramas/knights-and-magic type settings, etc. [obs] can include a term that was common 100 years ago(and possible exists in literature), but has been replaced.

As a policy we currently seem to favor [form] over [rare] if applicable, so there is no effective distinction between super-rare and super-common formal terms.

Similar to @stephenmk 's comment, it can be frustrating to encounter a term with a suspicious gloss(based on the kanji meanings or similar terms), look at the jmdict history, and see that it is either a: 1. word-for-word translation from nikk or daijs. or 2. "gg5". Then the word is nowhere on the internet, zero ngrams, and maybe appears in a handful of searchable texts. No context, impossible (or difficult, like 台胴) to verify the gloss ourselves.

It would be great to require these ultra-rare submissions to have even a single in-the-wild encounter to accompany the submission. At least there would be some context to look back at. We generally provide glosses, not definitions. Translating a daijs entry does not yield a reliable gloss on its own(and it seems like gg5 must have done this too in some cases).

These days I do ngram searches for everything before adopting. I try to tag things as well as I can so that someone like me in the future can save themselves some effort, but there are limits to what can be done with the current tags. I am not at all against jmdict being a compendium of all knowledge of Japanese vocab, however obscure. But [rare], [arch], [obs], [hist], and [form] do not seem adequate to single-out these ultra-rare kokugo-only terms. (台胴 isn't that rare by comparison to some others)

JMdictProject commented 8 months ago

Welcome to the world of comprehensive lexicography. It's always a challenge finding a balance between usefulness and wide coverage.

Provide evidence that the entry will actually be useful for someone else in the future. (SK)

I'm busy organizing a time machine for collecting such evidence. Seriously though, I'd rather include a valid rare term with appropriate tagging than omit it altogether.

But [rare], [arch], [obs], [hist], and [form] do not seem adequate to single-out these ultra-rare kokugo-only terms. (BK)

I don't really agree. As I wrote at the start of the issue: "ensure that where appropriate the relevant [rare], [arch], [obs], [hist], etc. tags are included to indicate their role (if any) in modern Japanese." If the appropriate tags are missing, they need to be added.

stephenmk commented 8 months ago

I'm busy organizing a time machine for collecting such evidence.

I don't believe we need a time machine to collect such evidence. If a Japanese learner organically comes across a word that they don't understand, if they cannot find it in JMdict, and if they then take the time to propose an entry for it, I think we can say with confidence that it is bound to be useful for other users eventually. The same cannot be said for words chosen at random from an old encyclopedia, which is precisely where a lot of those anonymous submissions were coming from ("操縦輪," "玉輪," "車輪窓," "輪癬," "輪講会"... someone just searched for "輪" on weblio).

I am not saying that these sorts of extremely rare / obsolete / archaic words should never be included. I am saying that people who want to submit such words should be expected to put in a bare minimum amount of effort to support their proposals. This could mean describing the context in which they encountered the word, or it could be some brief research demonstrating how the word is used in modern contexts. I don't think "gg5" alone suffices.

Seriously though, I'd rather include a valid rare term with appropriate tagging than omit it altogether.

Whether or not the term is "valid" is the hard part. I think it would be better to omit a term rather than to add an insufficiently researched entry that is liable to cause confusion or spread misinformation. Does "輪講会" still have the same meaning now as it did 100 years ago when Saitō translated it as "a reading society?" Does it have a more specific nuance? I don't know, and it would be nice to have some reassurance.

When these extremely rare terms are chosen at random from old resources and submitted at face value, we're bound to end up with insidiously incorrect submissions like the one for "台胴." I think it's dangerous to wave in entries like this.

At any rate, I don't recall seeing any of these types submissions recently, so for the time being it seems this is all moot.

briankrznarich commented 8 months ago

On @stephenmk 's last comment, I'll just say they mirror my experience and thoughts exactly. I couldn't put it better.

As to: "I don't really agree. As I wrote at the start of the issue: "ensure that where appropriate the relevant [rare], [arch], [obs], [hist], etc. tags are included to indicate their role (if any) in modern Japanese." "

Can I get confirmation on what you mean here?

I don't have the entry, but I could swear I had a few [form][rare] tags simplified to [form] with a comment that we don't do [rare][form], so I stopped.

Should most [arch] tags be double-tagged as [rare]? I'm certain that's not currently the case. It might by nice if we could tag [arch] forms as "still in modern use for historic flavoring" somehow, rather than the opposite.

I tagged this as [rare], and after getting an explicit comment acknowledging its rarity, it was flipped to [arch]. 疎食 Here's another I marked as [rare]: 蔬食 First I got a well-considered (with refs) "Yes, rare." Then I got: "Surely archaic." with [rare]->[arch]

So in my experience [arch] says very little about frequency modern usage. (Until 卿 I was under the impression that [arch] implied [rare], i.e. "not in modern use", but I now see that this is not the case).

[form] also says nothing about frequency.

That's an issue of its own, but it is my rationale for why adding even more ultra-rare vocab gradually makes [form], [arch] etc, less and less likely to indicate that the term has any actual use in modern Japan (literary or otherwise).

(In many ways, it would seem better if [rare] trumped all other tags, even [arch], which would leave [arch] for only terms a modern Japanese person might recognize. But that would seem to be a huge shift from the current tagging)

stephenmk commented 8 months ago

I agree that 卿 is a clearly distinct kind of "archaic" from the 疎食 kind of "archaic." It would be helpful to have some kind of distinction in the tagging.

Since the original submission for 疎食 only included a link to the nikkoku entry on kotobank, I can only assume that this wasn't a word that was encountered "in the wild." This is the kind of submission that I argue should be summarily rejected. It's doubtful that these words will ever be useful to anyone, half of the time they get glossed incorrectly ("vegetarian meal" vs "humble fare"), and they just end up wasting everyone's time and attention.


The recent edits on the entry for 妖気 led me to some old discussions in jmdictdb on this topic. It looks like everything I've said on this issue has been said before.

JMdictProject commented 7 months ago

When I opened this issue I hoped we could get agreement that low-frequency terms which only appear in one or two kokugos could be given appropriate tags admitted without debate. It would "draw a line in the sand" and save a lot of arguing. It seems we can't get that agreement.

In a case-by-case evaluation, I guess I'll be looking for an n-gram count of, say, 30+ and appearance in at least one kokugo other than Nikkoku.