JMDict proper name policy suggestion

jimmubreen commented 4 years ago

This is a brief email thread from 2018 that encompassed some interesting and useful suggestions. I think it is best copied in here so it doesn't get lost.

Marcus Richert 24 Mar 2018 I would like to suggest that we change policy to allow for a lot more proper names in JMDict than we currently do. A lot of the entries in JMnedict are extremely helpful/useful, but the overall quality of the database is a little questionable. For whatever reason, most implementations of JMDict unfortunately don't include the JMnedict file, or only partially implement it, meaning its entries go mostly unseen by JMDict's end users. This seems like a big waste to me.

JMnedict's ~740,155 entries are categorized something like this (based on a count I did sometime in February):

male 19,919 surname 146,453 fem 109,196 product 422 given 61,643 unclass 131,397 organization 3,176 work 745 company 919 person 53,346 place 229,102 station 8,251

I think the subset of entries we might want to move over to JMDict are a further subset of the ones tagged [product] (422), [organization] (3176), [work] (745), [company] (919) and not least, [person] (53,346). I'd also like to see us do more "top-level" entries from [place] at some convenient cut-off, like maybe saying ALL Japanese and foreign cities (in Japanese, everything that's a -市 or a -群) are fine in JMDict, while not overly notable towns and villages (e.g. -町 -村) stay only in JMnedict. I think JMnedict can still be a place for station names, surnames, given names, low-level place names and unclassified entries, which already make up around 90% of the database's entries.

I'm not suggesting we should do a massive automated move of whatever subset of entries we decide on, though. I'm thinking more along the lines of allowing editors to check entries in JMnedict and then move them over to jmdict, self-approving the move, and also allowing for new entry submissions for proper names to be accepted into JMDict. It'd just be a question of deciding on what level of notability, etc., we'd want for these proper names entries. With so many dictionaries/encyclopedias/other resources to lean on though, I feel we could mostly just implement the exact same checks we do for all other JMDict entries (i.e. is it an entry in any other dictionary/encyclopedia? If not, is it at least frequently appearing in recent news? If not, does it at least have a very high ngrams count?).

It's worth noting that few other Japanese dictionaries, monolingual or bilingual, keep such a strict division between proper nouns and other words such as JMDict/JMnedict. Daijisen and Daijirin feature thousands and thousands of place names, Japanese surnames, names of persons, works of art, novels, etc., even with the existence of the 90,000 entries strong proper noun dictionary Daijisen Plus. GG5 has over 300 entries for (mostly foreign) song names, 400+ for (mostly foreign) film names, 900+ for names of (mostly foreign) novels and at least 2000 entries for (mostly foreign) cities and towns, and probably a couple of thousand person entries (again, almost exclusively foreigners). Wadoku (EDICT's German cousin) contains thousands of proper noun entries for people, works of art, places and surnames.

Jim Breen 27 March 2018

Just a couple more thoughts on this issue.

One is that the JMdict/JMNedict split has never bothered me much because I've actually been insulated from it. In my working environment I run the ancient "xjdic" app in one of the windows on my screen and in its global mode it shows results from 10 multiple dictionaries at once. So Marcus' comment about "go mostly unseen by JMDict's end users" is perhaps a wakeup call.

Another is that JMdict/JMnedict are slightly different beasts. The entries in the work/person/company/organization sets in JMnedict are rather like regular dictionary entries. A significant proportion of the surname/given/etc. groups are quite simplistic surface-form+reading+romanized-reading. There are 10 entries with 田中 as the surface form, but we can't really merge them as tying the romanized translations to the readings would be a mess, and anyway some of those readings occur with different kanji.

Consider the entries with the surface form 樹下. As seen in wwwjdic it is http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?2MDJ%BC%F9%B2%BC

What's happening there is that wwwjdic uses a crunched version of enamdict with combined entries where the surface form is the same. If the relatively common 樹下/きのした were to move to JMdict, would the other three come too? It's useful now to see the lot when looking up 樹下, andit would be a pity to lose it.

All food for thought.

Marcus 15 May 2018

Jim mentioned this topic on the "Dictionary Thoughts" page a couple of weeks ago but since there hasn't been much further comment I wanted to bring this to everybody's attention once again, esp. as there isn't really any wavering in our enforcement of the proper names/other nouns jmdict/jmnedict division. Enforcing the division makes sense of course if we haven't really made up our minds about how we want to progress, but if we are set on eventually moving over such entries anyway, it's not very meaningful to keep stuffing them into jmnedict.

What I'm trying to say is, though it's fine to hold off on migrating entries from jmnedict until later, as I assume there'd have to be some updates made to the database to bring over the various proper name tags into jmdict and we'd want to telegraph that way ahead for the dictionary makers to be able to implement that in time, I think we could very well try and make up our minds if this is how we want to proceed and if it is, begin to allow more proper noun entries in jmdict.

nicolasmaia commented 4 years ago

This is somewhat unrelated, but would this github project be an appropriate place to discuss JMnedict issues too?

jimmubreen commented 4 years ago

Yes, certainly. We already have issue #3 open covering a JMnedict matter.

JMdictProject commented 3 years ago

There's been further discussion on this in the バッハ page. See: https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1099260
I hope to switch the discussion to here.

Marcusjmdict commented 3 years ago

I am still, 3 years later, in favor of migrating a whole bunch of entries from jmnedict to jmdict. In relation to the more recent discussion on the バッハ entry, I think [surname] is preferable to [person]. I think it's a mistake to assume people will always be smart/knowledgable enough to make the distinction - for an extreme example beyond the scope of things we can do something about, look at the old mailing list for the post about the idol group Nogizaka46 member who signed a photograph with "JMdict" instead of the intended "firefly" (i.e. in English). I believe it has other implications for non-human use, not just machine-learning etc. but also for how the dictionary files are implemented and entries categorized in various applications.

robinjmdict commented 3 years ago

I'm generally opposed to moving large numbers of entries from JMnedict to JMdict, especially those of people, companies, works and products. I think it would be very difficult to decide on a criteria for inclusion that people could agree on. The convenient thing about having a separate names dictionary is that we don't have to consider the notability of any particular name.

I prefer [person] to [surname] as the entries refer to specific people. If the proposal is to remove all references to specific people, I'm opposed, because I think simple transliterations of names belong in the names dictionary. I don't think we want entries like "スミス: [surname] Smith" in JMdict.

jimmubreen commented 3 years ago

I think the balance at present is roughly right. As stated in the policy at https://www.edrdg.org/wiki/index.php/Editorial_policy#Proper_Names we give some priority to Japanese-related names. I agree with Robin that it would be hard to develop and agree on criteria for expansion of these. Partly it's an issue with sites and apps which don't also provide access to the names dictionary.

Marcusjmdict commented 3 years ago

Could I suggest we at least consider including all the Japanese cities, special wards, wards, town, districts (群), and villages in jmdict? 1,724 in total (according to this) - not an impossibly large amount. I'd suggest we do them without the 市, 区, 町, 群, 村 suffixes, which is in line what we're doing now with the cities that we do have, and also how the kokugos handle these - we have 京都 but not 京都市 (we do have both versions for the prefectures, i.e. 高知 and 高知県, but I don't think that's an issue).

Edit 10/10: I think we should also allow for major neighborhoods in a handful of the largest/most important cities (at least Tokyo, Osaka, Yokohama, and Nagoya).

Marcusjmdict commented 2 years ago

Also I think using the [place] tag in jmdict would be convenient. (whether or not in conjunction with us including the place names in jmdict that I suggested on Oct 5)

JMdictProject / JMdictIssues

JMDict proper name policy suggestion #15

This is a brief email thread from 2018 that encompassed some interesting and useful suggestions. I think it is best copied in here so it doesn't get lost.

All food for thought.