JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Macrons in romanized names #32

Closed JMdictProject closed 3 years ago

JMdictProject commented 3 years ago

[I began writing this as an email to the mailing list quite a long time ago. I sent one related email in 2019 - see https://www.edrdg.org/jmdict_edict_list/2019/msg00349.html and there was a thread in 2017 on romanization in general - see https://www.edrdg.org/jmdict_edict_list/2017/msg00060.html ]

The question has come up about the romanization of Japanese names, especially in enamdict/JMnedict, and the use of macrons for long vowels. At present most of the long vowels have been romanized using "wapuro romaji", i.e. something like じょう has been romanized as "jou". There are a couple of reasons for that:

All this meant that じょう had to romanized as "jo" or "joo" or "jou"or "jo-". I went for "jou".

Things have changed now and most systems are using Unicode which allows all sort of characters to be in the same text, so there's no reason we can't have 東京 and Tōkyō in the same file.

That raises the question of what to do with the romanization in enamdict/JMnedict, which is still mostly in the "jou" style.

What I suggest we do with this matter is: (a) use macrons throughout for long vowels. Thus Tōkyō, Hokkaidō, etc. (b) where we think it may be useful to have a macron-less version available for reverse searches, include that later in the name explanation. As an example of this, I have just tweaked the 京都 entry to become (in short form): 京都 [きょうと] /(1) (place) Kyōto (city, prefecture); Kyoto (2) (surname) (fem) Kyōto

This issue is about JMnedict, but the same could apply to names in JMdict.

nicolasmaia commented 3 years ago

I think your suggestion is sensible, and it's high time macrons be used in these entries.

polm commented 3 years ago

I vote against macrons. They were a reasonable choice in 1890, when Greek and Latin were part of normal Western education, but I don't think they help understanding now, and using them is why even places with hyper-conservative editorial conventions like the New Yorker write "Tokyo" without macrons - they just get filed off by people who don't understand them. Even the Japanese government doesn't use macrons.

It's true that utf-8 support is much better these days, so it'll be rare to be unable to show them, but there are still technical issues. You mention issues with search, but they can also cause confusion in URLs due to percent encoding, and in browsers font support can make them look bad even if they're technically displayed correctly. There's still edge cases where something is off, which is a non-issue with the current style.

Even if you prefer macrons in general, how much benefit do you think they provide? Is it worth the conversion effort, taking into account the potential errors and technical issues? You can't just use a string replace because of words like 邪悪 or 子牛. UniDic can help but is coverage good enough? I honestly don't know.

robinjmdict commented 3 years ago

I much prefer macrons to doubled letters or "ou" but is there a reliable way of doing the conversion? As polm points out, it's not possible to tell whether two adjacent kana represent a long vowel or not from the reading alone.

I vote against macrons. They were a reasonable choice in 1890, when Greek and Latin were part of normal Western education, but I don't think they help understanding now

For day-to-day use (e.g. newspapers), I agree. But as this is a Japanese-English dictionary, I think we should mark long vowels. They're a useful guide to pronunciation.

where we think it may be useful to have a macron-less version available for reverse searches, include that later in the name explanation.

I'm not sure about this. Do people use the names dictionary for reverse searches? If so, wouldn't we want macron-less versions for all names? Is it too much to expect websites, apps, etc. that use JMnedict data to handle this so that a search for "Kyoto" would return "Kyōto"?

JMdictProject commented 3 years ago

I'd like to keep the discussion going on this topic, and maybe we'll get to a suitable approach.

What I really want to do is move an entry like "冬四郎 [とうしろう] (male) Toushirou" to having the "English" part of it in a form that is actually usable, and the ワープロローマ字 currently used it not really appropriate. This entry should have either Toshiro or Tōshirō or both.

Since going to strictly macron-ed or macron-less versions is not going to please everyone, how about a general approach of including both forms? For 冬四郎 we could have:

I could live with any of those, although my preference is probably for the third one.

nicolasmaia commented 3 years ago

That feels a bit redundant. What would be the advantage of using a version without macrons, especially in conjunction with the macron-ed version?

JMdictProject commented 3 years ago

Polm asked: "Even if you prefer macrons in general, how much benefit do you think they provide?"

As far converting the existing romanization to include macron, it's tricky but quite doable for the bulk of the entries, Of course, you can't fall for the trap of turning 砥歌川 (Toutagawa) into Tōtagawa which would be wrong,

I've been keen in the past to facilitate reverse searches using the romanized versions. I agree with Robin that it should be up to the interface system. Maybe that's the way to go - have 東大 as "Tōdai" and leave it up the interface to find it using Todai, if it wants to.

nicolasmaia commented 3 years ago

I've been keen in the past to facilitate reverse searches using the romanized versions. I agree with Robin that it should be up to the interface system. Maybe that's the way to go - have 東大 as "Tōdai" and leave it up the interface to find it using Todai, if it wants to.

Yes, I think that makes more sense.

polm commented 3 years ago

to me the benefit is they indicate the pronunciation. I think it is important to indicate that 戸大 and 東台 are not both pronounced "Todai" with a short "o". If people want to omit the length signal associated with the macron, that's up to them, but I wouldn't like to omit this information from a dictionary.

To be clear I am not in favor of omitting macrons, I would argue that wapuro style is superior to macrons. It doesn't have the technical issues of macrons and is closer to Japanese orthography (no conversion issues!). Additionally even people with no knowledge of Japanese won't omit a "u" the same way they would a macron, so it's resistant to transmission errors.

I think some people might argue that macrons are more obviously long vowels than "ou" etc., but I believe that is either not the case or it is only true for a vanishingly small number of people. You could argue that differentiating long and non-continuous vowels is valuable for learners, but I think establishing that "ou" etc. are mostly long - just as you would if teaching Japanese with kana - is fine.

On the other hand I completely acknowledge that macron-omitted form is necessary when customary ("Tokyo"), and maybe desirable for search purposes.

JMdictProject commented 3 years ago

I'm just rolling out a modification in WWWJDIC that supports macron-less search keys matching names with macrons. So if you search for "anjiro" you get a match on アンジロー/Anjirō. It's working with ō and ū at present but can be extended.

Now that this is working, I'll look into turning a lot of the "ou" and "uu" romanizations into ō and ū.

JMdictProject commented 3 years ago

Those WWWJDIC changes are in place and seem to be working OK. I'm progressively converting "ou" romanizations to ō, starting with the low-hanging fruit like ~峠 and ~道. There are about 100k entries with "ou" readings and I've done about 10%.

I'll close this now.