JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Oddities with Meikyō する-verb transitivity update #67

Closed stephenmk closed 1 year ago

stephenmk commented 2 years ago

Yesterday I found that 跋扈 (1573260) had not been updated with a [vi] tag by last year's automated "Meikyo vt and vi additions" process as we would have expected. I found that another word with the same kanji, 扈従 (2104700), had also not been updated.

Marcus wrote:

wonder why the automated process missed this one, just one sense in meikyo.

Robin replied:

meikyo: ばっ‐こ【▼跋▼扈】 I'm guessing that Jim's script didn't strip the triangles (which mark non-joyo kanji) from meikyo's headwords, meaning it didn't match with any entry in jmdict. There are probably quite a few entries that were skipped over because of this. Maybe Jim can take a look at this when he has time.

I went looking for some counter-examples and found these three entries. They were all correctly updated by the script. Doesn't seem like the ▼ marks by themselves are necessarily the culprit.

1166970 ひとめ‐ぼれ【一目▼惚れ】 1563300 はい‐よう【▼佩用】 1563670 ふ‐かん【▼俯▼瞰】

This issue may be worth investigating. My guess is that this was caused by an error in converting 扈 from it's EPWING encoding (SHIFT-JIS, I think?) into unicode encoding.

JMdictProject commented 2 years ago

I'll look into this when I get a chance, but it will be several weeks off.

stephenmk commented 1 year ago

I believe I've now updated the vast majority (if not all) of entries that were missed. (Or at least the ones that were missed due to this particular error).

It seems the affected entries were skipped because the corresponding forms in the meikyo EPWING contain kanji that are encoded with ad-hoc bitmap images rather than regular EUC-JP or UTF fonts.

JMdictProject commented 1 year ago

Codepoints not fonts. Yes, EPWING was rather limited. I won't pursue it any further and the issue can be closed.

stephenmk commented 1 year ago

I think some entries (like 結婚) probably didn't get picked up because, despite only containing one sense in English, they contain multiple senses with glosses from different languages. In any case, I think the edits that I submitted today should cover all the entries that were missed.

EDIT: Actually, I missed some. I'll open a new issue.