JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Readings of initialisms #88

Closed robinjmdict closed 1 year ago

robinjmdict commented 1 year ago

As I mentioned in the comments on entry 2446480, I’m proposing a change to the way we handle readings of initialisms.

Our current usual practice is to have a single reading, without nakaguro separating each letter. Previous proposals to include nakaguro versions have generally been rejected on the grounds that they serve little use and add clutter, as initialisms in Japanese are written in Roman letters the vast majority of the time.

But with the introduction of search-only forms in JMdict, clutter is no longer the concern it once was. Additionally, there is a benefit to using nakaguro in readings of initialisms: it makes them easier to read. For these reasons, I think it would be sensible to use nakaguro in all non-[uk] initialisms and make the existing nakaguro-less forms search-only. For [uk]-tagged entries, I'd suggest the reverse: making the nakaguro versions search-only.

This could be automated with a script and the bulk updater. I assume the process would be similar to the one used in 2013 to add nakaguro to 外来語.

JMdictProject commented 1 year ago

I think that approach is sound and can go ahead. The "readings" of most initialisms seem to occur quite rarely in text, for example for CPM neither シーピーエム nor シー・ピー・エム reaches the n-gram threshold (we have the former and the JEs have the latter.)

In fact, (uk)-tagged initialisms are rather rare. The few there, such as CQC/シーキューシー, look a bit dubious.

I suspect it is unlikely the generation and insertion of nakaguro forms can be automated very easily. The big one that was done in 2103 for gairaigo used the software behind the segmenter at http://nlp.cis.unimelb.edu.au/jwb/clst/gairaigo.html which uses a big gairaigo database. This doesn't really work with initials. I'll do some edits by hand and see if that indicates whether an automated approach might work. It might be best to just tidy them up by hand.

robinjmdict commented 1 year ago

This is how I think it could be automated:

First, you extract all the entries that have a kanji form consisting only of capital letters (and numbers). Then, after creating a mapping of letters/numbers to katakana readings, you use the letters/numbers in the kanji field to generate a nakaguro reading for each entry. Finally, sk tags would be added to any existing readings without nakaguro.

While typing this, I realised that this set of entries would include acronyms (e.g. NASA, AIDS). But depending on how many initialism/acronym entries there are, we may be able to go through the list manually and remove all the acronyms (or find a clever way to automate this as well).

JMdictProject commented 1 year ago

A quick update on this one.

I'm still working on identifying the 450. Once that's done I should be able to format it for the updater.

JMdictProject commented 1 year ago

Another issue to do with initialisms is the variant katakanaizations of various letters. For example, H can be either エイチ or エッチ. It seems some published dictionaries prefer エッチ but both the n-grams and WWW hits favour エイチ. Our current ADHD entry is "エーディーエッチディー" (as in Koj), but エーディーエイチディー is in GG5 and 中辞典, and gets more WWW hits.

I propose to standardize on エイチ, and include the エッチ versions as [sk].

robinjmdict commented 1 year ago

I agree with the proposal. In real-world use, エイチ clearly dominates.

I think J (ジェイ/ジェー) and K (ケイ/ケー) are the only other letters where this is an issue.

ジェイ appears to be the preferred form in almost all cases whereas ケイ/ケー is a lot more mixed.

ジェイポップ 3,775
ジェーポップ 116 ディージェイ 2,079
ディージェー 317

ケイワン 1,908
ケーワン 994
エヌエイチケイ 639 エヌエイチケー 882

The dictionaries consistently use ジェー/ケー, even when the "official" pronunciation is different (e.g. エヌエイチケイ, ケイワン).

I suggest standardising on ジェイ and ケイ, with ジェー and ケー versions as sk.

JMdictProject commented 1 year ago

I suggest standardising on ジェイ and ケイ, with ジェー and ケー versions as sk.

Not so sure about that. I think ジェー and ケー are the more "official" forms and ジェイ and ケイ are more what the great unwashed happen to use (or it's what pops out of their IMEs). I see the GG5 entry for 2DK has "にディーケー" (as does JMdict.)

Since we are really indicating the pronunciation of initialisms I think we should stick to ー for the representation of longer sounds.

robinjmdict commented 1 year ago

The "great unwashed" does include NHK and the JR Group, at least when it comes to their names.

That being said, NHK's own pronunciation guide lists ジェー and ケー as the standard forms, with ジェイ and ケイ as permitted variants.

Screenshot 2023-03-11 at 11 41 41

Wikipedia appears to favour ジェイ and ケイ (see DJ, 4K, K2).

I do agree that ジェー/ケー are probably the more appropriate choice for a dictionary. They're a better indication of pronunciation (in most cases), and ー is used to represent long vowels in all the other letters.

JMdictProject commented 1 year ago

Interesting inconsistency on NHK's part. Yes, let's go with ジェー/ケー. I'm still polishing the ~400 forms, and I've a bit of programming to do to create the update commands. Should have something by the end of the week.

JMdictProject commented 1 year ago

OK, the ~400 initials entries have been updated as proposed. I'll leave this issue open for a while. There may be some progress with #89 eventually.