JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Invalid format of <xref>: using centre-dot while doctype says it must not #90

Closed scriptin closed 1 year ago

scriptin commented 1 year ago

Hello!

According to the JMdict's doctype:

<!ELEMENT xref (#PCDATA)*>
        <!-- This element is used to indicate a cross-reference to another
        entry with a similar or related meaning or sense. The content of
        this element is typically a keb or reb element in another entry. In some
        cases a keb will be followed by a reb and/or a sense number to provide
        a precise target for the cross-reference. Where this happens, a JIS
        "centre-dot" (0x2126) is placed between the components of the
        cross-reference. The target keb or reb must not contain a centre-dot.

Note the last sentence: "The target keb or reb must not contain a centre-dot."

In the latest release, I've just found several instances of this rule being broken:

$ grep '<xref>.*・.*・[^0-9]' JMdict.xml
<xref>OH・オー・エイチ</xref>
<xref>OB・オー・ビー・1</xref>
<xref>SP・エス・ピー・3</xref>
<xref>CGI・シー・ジー・アイ・2</xref>
<xref>HWR・エイチ・ダブリュー・アール</xref>
<xref>LCD・エル・シー・ディー</xref>
<xref>CM・シー・エム・1</xref>
<xref>SF・エス・エフ</xref>
<xref>ユー・エス・ビー・1</xref>
<xref>DK・ディー・ケー・1</xref>
<xref>LDK・エル・ディー・ケー</xref>
<xref>SF・エス・エフ</xref>
<xref>SFA・エス・エフ・エー</xref>

Related to #88 and #89

stephenmk commented 1 year ago

I made note of this a couple of times, but it seems some still managed to slip through.

Edit: Amendments now submitted for all the entries mentioned above. I also have my own program for detecting irregularities in these \<xref> elements, and it only found one other problem in addition to those already noted.

JMdictProject commented 1 year ago

Thanks for pointing this out. I'd overlooked it totally, and I'd missed Stephen's comments. The cases need to be tracked down and amended (in progress at the moment.) We also need to make sure that in future the xref is constructed to avoid including the nakaguro form.

The NG version of the JMdict XML will remove this problem. (I'll leave the issue open for a bit to keep it visible.)

JMdictProject commented 1 year ago

Can close it now.