Latest enamdict files no longer have accented glosses

JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management

18 stars 1 forks source link

Latest enamdict files no longer have accented glosses #132

Closed melink14 closed 3 months ago

melink14 commented 3 months ago

Awhile back many of the english glosses in enamdict were updated to use modern accented characters to represent long or accented vowels.

For example, 勘太朗 became Kantarō. This is still true if we look at the entry in the DB: https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&e=2243482

However when downloading the enamdict file from http://ftp.edrdg.org/pub/Nihongo/enamdict.gz all such characters are now rendered normal ascii characters (so Kantarō is rendered as Kantaro). As far as I can tell this happened to all such characters.

This has been the case for at least 2 weeks but less than 3 weeks based on the diff in my weekly dictionary update. (It's just about 80,000 glosses which are effected)

I assume this isn't expected but did something change about how that file is generated recently?

JMdictProject commented 3 months ago

The change was made to cater for some Vietnamese diacritics; converting them to just regular alphabetics. It wasn't supposed to touch things like ō.

I'll need to investigate it - the same XSLT script is used for generating the edict editions and the ō are OK there.

It might be a day or two before it can be resolved.

melink14 commented 3 months ago

Thanks! No problems even if it takes a few days!

JMdictProject commented 3 months ago

I've switched back to the previous scripts, so from the next daily generation all those ō/ū names should return. The 10 or so Vietnamese names will be a bit mangled until we get a proper fix.

The xslt scripts that generate the edict/enamdict versions are different for the two dictionaries.

JMdictProject commented 3 months ago

OK, I think it's all fixed, and from tomorrow's distribution, all the diacritics available via JIS coding will be in enamdict. Some, such as the more way-out Vietnamese ones, will just have the plain alphabetics. I think this can be closed now.

JMdictProject commented 3 months ago

I should have mentioned that the great XSLT scripts that turn the JMdict/JMnedict files into the legacy edict/enamdict forms were developed by Jean-Luc Leger. Jean-Luc provided the updates that sorted out this issue.