Transliteration gloss type for JMnedict

stephenmk commented 2 years ago

Similarly to how we currently have [expl], [fig], etc., gloss types in JMdict, I am wondering if it would be both desirable and technically feasible to have a transliteration / romanization gloss type for JMnedict.

A common complaint among users of the Yomichan browser extension is that JMnedict entries often clutter up search results with low quality information. Some users eventually disable JMnedict entirely due to this annoyance. The source of the problem is that there are so many generic name entries with glossaries which only contain romanized forms of particular readings.

I put together a new version of JMnedict for Yomichan which consolidates much of this generic information. So far I've only received positive feedback. My program for producing this dictionary discards any glosses that it determines to be romanizations and then merges the related headwords and readings into single entries.

Search results for 田中

![tanaka](https://user-images.githubusercontent.com/8003332/182247038-c8d98ee8-0b8e-40e8-8917-37183cc202e6.png)

Search results for 樹下 compared with WWWJDIC

![kinoshita](https://user-images.githubusercontent.com/8003332/182247057-b5cd1f0f-3808-49f6-abec-28e3ca0bf8fd.png)

Search results for 岩波国語辞典 (since there's more info than just a romanization, these glosses are not discarded)

![iwanami](https://user-images.githubusercontent.com/8003332/182249313-4362d813-e6df-4f34-b3e3-d0fea3d6aa7a.png)

Would there be any interest in having this metadata stored directly in JMnedict in the form of glossary types or miscellaneous sense tags? If so, I can produce a list of sequence numbers, sense numbers, and glosses which I have determined to be transliterations.

By my estimate, over 80% of the glosses in JMnedict are transliterations. A lot of this information, particularly surnames and given names, does not seem to me to be particularly useful. For example, 田中 returns 10 surname readings, but there is no indication that it is almost always read as たなか.

However, I don't think anyone can deny that the other ~20% of the corpus is quite useful. There's been some discussion in the past about how JMnedict is sometimes overlooked by developers. Having this metadata might be one step we can take towards making the dictionary more appealing.

JMdictProject commented 2 years ago

With 80% of the entries basically being transliterations of the kana forms, I'm not sure tagging them as such would help a lot.

I've been aware from its inception that the raw JMnedict/enamdict data was a problem for downstream applications as some of the forms, e.g. 田中, have a heap of entries with different readings and transliterations associated with them. Simply bundling them was of limited help as some are relatively rare. To handle this in WWWJDIC I use a heavily-massaged version of enamdict in which:

all the entries that share a common kanji form are merged into a single entry;
a file of about 80k high-frequency kanji/kana combinations is used to enable the move of the more common forms to the front of the combined entries.

Thus for example, if you look up 田中 in WWWJDIC you get: 田中【たなか】 (p,s,f) Tanaka 【でんちゅう】 (s,g) Denchuu 【たなた】 (s) Tanata 【たんか】 (s) Tanka 【だなか】 (s) Danaka 【なたか】 (s) Nataka 【ぬなか】 (s) Nunaka 【のなか】 (s) Nonaka 【ひろか】 (s) Hiroka 【やなか】 (s) Yanaka たなか and でんちゅう were moved to the front and the rest follow in 五十音 order.

This is all done as a daily post-processing exercise and doesn't require anything different in JMnedict itself.

Perhaps something like this is the way to go, rather than add extra tags to JMnedict? Having a [transl] or whatever tag would not be a lot of use in many cases. For WWWJDIC I'd probably filter it out.

stephenmk commented 2 years ago

If it would be possible to distribute frequency information with JMnedict, I think that could be very useful. It sounds like one flaw in the setup is that, for example, it's not clear whether the 田中・でんちゅう pair is more frequently used as a surname or as a given name. Still, I think having this information would likely be better than nothing. Perhaps priority tags similar to the [nfxx] JMdict tags could be added?

My idea was that the transliteration tags could give developers and end-users more options with respect to how the data is organized. Certainly there wouldn't be any reason to display these tags directly to end-users. Rather than that, these tags could be used to split JMnedict into two broad categories: "generic" name entries (the 80%) and "specific" name entries (the 20%), with the latter probably being of more interest. So for example, I could use the JMnedict data to produce separate Yomichan dictionary files for each category, and users could decide if they want to install one or both.

These might be concerns that are specific to Yomichan, but I also wouldn't want to bundle / combine the "specific" name entries in the same way that I've combined the entries with transliteration glosses. Doing so would result in some loss of functionality for those entries. I also have no qualms about assuming that users can read kana, so to me the transliterations are just taking up valuable space. Having a transliteration tag would give developers the option to hide or omit the information.

I won't push the issue, especially if this doesn't sound interesting to anyone else, but I thought it might be worth bringing up for discussion.

JMdictProject commented 2 years ago

I suspect you could do an on-the-fly division and achieve much the same outcome. If you put the entries with the of fem/masc/given/place/surname/unclass in one group and all the others in another it may well work.

As for identifying the more common readings of the kanji name forms, the data I use is over 20 years old, and while it's useful for tweaking the WWWJDIC output, I don't think I'd want to see it in the database as some sort of supported priority code. It would be a nice project to try and do that sort of extraction. Some sites worth looking at include: https://myoji-yurai.net/sp/prefectureRanking.htm https://en.wikipedia.org/wiki/List_of_common_Japanese_surnames (look at the references)

stephenmk commented 2 years ago

I suspect you could do an on-the-fly division and achieve much the same outcome. If you put the entries with the of fem/masc/given/place/surname/unclass in one group and all the others in another it may well work.

This doesn't work, unfortunately. The person and place categories especially are very mixed. Many of them are either foreign names, Japanese names with supplemental information in parentheses, or partially translated place names (like "Himawari Park").

Breakdown by individual type

| type | romanized
gloss
count | total
gloss
count | perc. | | :-- | --: | --: | --: | | char | 2 | 120 | 1.7% | | company | 45 | 1,212 | 3.7% | | creat | 0 | 2 | 0.0% | | dei | 0 | 9 | 0.0% | | doc | 0 | 1 | 0.0% | | ev | 0 | 30 | 0.0% | | fem | 106,865 | 109,862 | 97.3% | | fict | 0 | 13 | 0.0% | | given | 61,028 | 61,846 | 98.7% | | group | 0 | 51 | 0.0% | | leg | 0 | 1 | 0.0% | | masc | 19,954 | 20,601 | 96.9% | | myth | 0 | 6 | 0.0% | | obj | 0 | 18 | 0.0% | | organization | 24 | 5,960 | 0.4% | | person | 23,726 | 53,747 | 44.1% | | place | 187,144 | 229,788 | 81.4% | | product | 6 | 585 | 1.0% | | relig | 0 | 1 | 0.0% | | serv | 1 | 76 | 1.3% | | station | 10 | 8,261 | 0.1% | | surname | 141,944 | 147,492 | 96.2% | | unclass | 86,676 | 134,012 | 64.7% | | work | 5 | 1,179 | 0.4% | | **TOTAL** | **627,430** | **774,873** | **81.0%** |

Breakdown by type combinations

| types | romanized
gloss
count | total
gloss
count | perc. | | :-- | --: | --: | --: | | place | 170,342 | 212,778 | 80.1% | | unclass | 86,676 | 134,012 | 64.7% | | surname | 120,267 | 125,532 | 95.8% | | fem | 102,872 | 105,624 | 97.4% | | given | 57,971 | 58,745 | 98.7% | | person | 23,379 | 53,218 | 43.9% | | masc | 19,445 | 19,897 | 97.7% | | place; surname | 15,292 | 15,403 | 99.3% | | station | 10 | 8,261 | 0.1% | | organization | 24 | 5,959 | 0.4% | | given; surname | 2,556 | 2,581 | 99.0% | | fem; surname | 2,565 | 2,579 | 99.5% | | company | 43 | 1,185 | 3.6% | | work | 5 | 1,164 | 0.4% | | fem; place; surname | 729 | 730 | 99.9% | | product | 6 | 578 | 1.0% | | fem; person | 251 | 386 | 65.0% | | fem; place | 264 | 316 | 83.5% | | masc; surname | 215 | 310 | 69.4% | | given; place; surname | 290 | 291 | 99.7% | | given; place | 209 | 226 | 92.5% | | fem; masc | 163 | 197 | 82.7% | | masc; person | 95 | 136 | 69.9% | | char | 2 | 107 | 1.9% | | serv | 1 | 76 | 1.3% | | group | 0 | 51 | 0.0% | | ev | 0 | 30 | 0.0% | | fem; masc; surname | 18 | 23 | 78.3% | | masc; place; surname | 7 | 20 | 35.0% | | company; surname | 1 | 18 | 5.6% | | obj | 0 | 14 | 0.0% | | masc; place | 7 | 11 | 63.6% | | char; work | 0 | 10 | 0.0% | | dei | 0 | 7 | 0.0% | | company; product | 0 | 4 | 0.0% | | fict | 0 | 4 | 0.0% | | fict; obj | 0 | 4 | 0.0% | | myth | 0 | 4 | 0.0% | | fem; masc; place | 0 | 3 | 0.0% | | fem; masc; place; surname | 3 | 3 | 100.0% | | person; work | 0 | 3 | 0.0% | | char; fict | 0 | 2 | 0.0% | | company; person | 0 | 2 | 0.0% | | company; place | 0 | 2 | 0.0% | | fict; place | 0 | 2 | 0.0% | | char; product | 0 | 1 | 0.0% | | company; place; surname | 1 | 1 | 100.0% | | creat; fict | 0 | 1 | 0.0% | | creat; leg | 0 | 1 | 0.0% | | dei; myth | 0 | 1 | 0.0% | | dei; relig | 0 | 1 | 0.0% | | doc | 0 | 1 | 0.0% | | fem; work | 0 | 1 | 0.0% | | given; masc | 1 | 1 | 100.0% | | given; person | 1 | 1 | 100.0% | | given; work | 0 | 1 | 0.0% | | myth; place | 0 | 1 | 0.0% | | organization; product | 0 | 1 | 0.0% | | person; place | 0 | 1 | 0.0% | | product; surname | 0 | 1 | 0.0% | | **TOTAL** | **603,711** | **750,523** | **80.4%** |

Here's the sqlite3 database that I used to produce the above tables: jmnedict.sqlite3.zip

JMdictProject commented 2 years ago

All that indicates that it would be a rather large task to get accurate transliteration tag information. I can't say I have a lot of enthusiasm for it. BTW, are the [unclass] statistics correct? I scanned a few hundred of those entries and only found a few which are not simple transliterations, and most of those, e.g. クイーンメリー号, are easy to reclassify.

stephenmk commented 2 years ago

Looks like the [unclass] category has tens of thousands of foreign names in it, which I didn't consider to be transliterations for this purpose since the glosses contain useful info that's not always obvious from the kana. There's also a modest amount of glosses for entries with obsolete kana like 「みつゑ "Mitsue (Mitsuwe)"」that I didn't bother to parse, so those are also not considered to be transliterations in the statistics. If it would be worthwhile to reformat those entries (e.g. by splitting them into two glosses or by removing the portion in parentheses), I can put a list together.

stephenmk commented 2 years ago

By my count there are 630 such entries in jmnedict. If it's not a whole lot of effort, would it be possible to use the bulk updater to reformat these?

https://gist.github.com/stephenmk/fe928d1addbe11c384721df6f97fa06f

It's not unheard-of for the "wi" "wo" etc romanizations to be used in English-language media ([1], [2], [3]), so I don't think it hurts to have both forms as glosses.

JMdictProject commented 2 years ago

Can be done. I'm not sure it's necessary, but it's not hard. I did one as an example - see entry 5083440 (ミツヱ)

If you are doing extractions from the database. it would be a great help if you could generate the update file at the same time. The instructions for updating that ミツヱ consisted of three lines: ... seq 5083440 repl gloss "Mitsue (Mitsuwe)" "Mitsue" add gloss "Mitsuwe" ... If you are able to generate a file in that format, it would save me a heap of time.

[I didn't mean to make it go bold like that.]

stephenmk commented 2 years ago

I did one as an example - see entry 5083440 (ミツヱ)

Looks like I accidentally excluded names that are written exclusively in katakana from my original search. I've added those entries to the list (186 total, excluding 5083440 ミツヱ). There are now 816 entries to update.

If you are doing extractions from the database. it would be a great help if you could generate the update file at the same time.

Not a problem at all. This should hopefully work: https://gist.github.com/stephenmk/f81b4989df6ca5f944014665f8d6959c

JMdictProject commented 2 years ago

Thanks. Done. I'll close this for now.

JMdictProject / JMdictIssues

Transliteration gloss type for JMnedict #73