CoreOffice / XMLCoder

Easy XML parsing using Codable protocols in Swift
https://coreoffice.github.io/XMLCoder/
MIT License
795 stars 107 forks source link

ENTITY tags #199

Open MartinP7r opened 4 years ago

MartinP7r commented 4 years ago

The xml file I'm working with contains a lot of <!ENTITY...> style abbreviations inside the DOCTYPE tag that don't seem to get picked up. Is there any configuration I have to do in order to make it work?

the tags look like this:

?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE JMdict [
<!ELEMENT JMdict (entry*)>

// (...)

<!ENTITY MA "martial arts term">
<!ENTITY X "rude or X-rated term (not displayed in educational software)">
<!ENTITY abbr "abbreviation">
<!ENTITY adj-i "adjective (keiyoushi)">
<!ENTITY adj-ix "adjective (keiyoushi) - yoi/ii class">
<!ENTITY adj-na "adjectival nouns or quasi-adjectives (keiyodoshi)">
<!ENTITY adj-no "nouns which may take the genitive case particle `no'">
<!ENTITY adj-pn "pre-noun adjectival (rentaishi)">
<!ENTITY adj-t "`taru' adjective">
<!ENTITY adj-f "noun or verb acting prenominally">
<!ENTITY adv "adverb (fukushi)">
<!ENTITY adv-to "adverb taking the `to' particle">
<!ENTITY arch "archaism">
<!ENTITY ateji "ateji (phonetic) reading">
<!ENTITY aux "auxiliary">
<!ENTITY aux-v "auxiliary verb">
<!ENTITY aux-adj "auxiliary adjective">
<!ENTITY Buddh "Buddhist term">
<!ENTITY chem "chemistry term">
<!ENTITY chn "children's language">
<!ENTITY col "colloquialism">
<!ENTITY comp "computer terminology">
<!ENTITY conj "conjunction">
<!ENTITY cop "copula">
<!ENTITY ctr "counter">
<!ENTITY derog "derogatory">
<!ENTITY eK "exclusively kanji">
<!ENTITY ek "exclusively kana">
<!ENTITY exp "expressions (phrases, clauses, etc.)">
<!ENTITY fam "familiar language">
<!ENTITY fem "female term or language">
<!ENTITY food "food term">
<!ENTITY geom "geometry term">
<!ENTITY gikun "gikun (meaning as reading) or jukujikun (special kanji reading)">
<!ENTITY hon "honorific or respectful (sonkeigo) language">
<!ENTITY hum "humble (kenjougo) language">
<!ENTITY iK "word containing irregular kanji usage">
<!ENTITY id "idiomatic expression">
<!ENTITY ik "word containing irregular kana usage">
<!ENTITY int "interjection (kandoushi)">
<!ENTITY io "irregular okurigana usage">
<!ENTITY iv "irregular verb">
<!ENTITY ling "linguistics terminology">
<!ENTITY m-sl "manga slang">
<!ENTITY male "male term or language">
<!ENTITY male-sl "male slang">
<!ENTITY math "mathematics">
<!ENTITY mil "military">
<!ENTITY n "noun (common) (futsuumeishi)">
<!ENTITY n-adv "adverbial noun (fukushitekimeishi)">
<!ENTITY n-suf "noun, used as a suffix">
<!ENTITY n-pref "noun, used as a prefix">
<!ENTITY n-t "noun (temporal) (jisoumeishi)">
<!ENTITY num "numeric">
<!ENTITY oK "word containing out-dated kanji">
<!ENTITY obs "obsolete term">
<!ENTITY obsc "obscure term">
<!ENTITY ok "out-dated or obsolete kana usage">
<!ENTITY oik "old or irregular kana form">
<!ENTITY on-mim "onomatopoeic or mimetic word">
<!ENTITY pn "pronoun">
<!ENTITY poet "poetical term">
<!ENTITY pol "polite (teineigo) language">
<!ENTITY pref "prefix">
<!ENTITY proverb "proverb">
<!ENTITY prt "particle">
<!ENTITY physics "physics terminology">
<!ENTITY quote "quotation">
<!ENTITY rare "rare">
<!ENTITY sens "sensitive">
<!ENTITY sl "slang">
<!ENTITY suf "suffix">
<!ENTITY uK "word usually written using kanji alone">
<!ENTITY uk "word usually written using kana alone">
<!ENTITY unc "unclassified">
<!ENTITY yoji "yojijukugo">
<!ENTITY v1 "Ichidan verb">
<!ENTITY v1-s "Ichidan verb - kureru special class">
<!ENTITY v2a-s "Nidan verb with 'u' ending (archaic)">
<!ENTITY v4h "Yodan verb with `hu/fu' ending (archaic)">
<!ENTITY v4r "Yodan verb with `ru' ending (archaic)">
<!ENTITY v5aru "Godan verb - -aru special class">
<!ENTITY v5b "Godan verb with `bu' ending">
<!ENTITY v5g "Godan verb with `gu' ending">
<!ENTITY v5k "Godan verb with `ku' ending">
<!ENTITY v5k-s "Godan verb - Iku/Yuku special class">
<!ENTITY v5m "Godan verb with `mu' ending">
<!ENTITY v5n "Godan verb with `nu' ending">
<!ENTITY v5r "Godan verb with `ru' ending">
<!ENTITY v5r-i "Godan verb with `ru' ending (irregular verb)">
<!ENTITY v5s "Godan verb with `su' ending">
<!ENTITY v5t "Godan verb with `tsu' ending">
<!ENTITY v5u "Godan verb with `u' ending">
<!ENTITY v5u-s "Godan verb with `u' ending (special class)">
<!ENTITY v5uru "Godan verb - Uru old class verb (old form of Eru)">
<!ENTITY vz "Ichidan verb - zuru verb (alternative form of -jiru verbs)">
<!ENTITY vi "intransitive verb">
<!ENTITY vk "Kuru verb - special class">
<!ENTITY vn "irregular nu verb">
<!ENTITY vr "irregular ru verb, plain form ends with -ri">
<!ENTITY vs "noun or participle which takes the aux. verb suru">
<!ENTITY vs-c "su verb - precursor to the modern suru">
<!ENTITY vs-s "suru verb - special class">
<!ENTITY vs-i "suru verb - included">
<!ENTITY kyb "Kyoto-ben">
<!ENTITY osb "Osaka-ben">
<!ENTITY ksb "Kansai-ben">
<!ENTITY ktb "Kantou-ben">
<!ENTITY tsb "Tosa-ben">
<!ENTITY thb "Touhoku-ben">
<!ENTITY tsug "Tsugaru-ben">
<!ENTITY kyu "Kyuushuu-ben">
<!ENTITY rkb "Ryuukyuu-ben">
<!ENTITY nab "Nagano-ben">
<!ENTITY hob "Hokkaido-ben">
<!ENTITY vt "transitive verb">
<!ENTITY vulg "vulgar expression or word">
<!ENTITY adj-kari "`kari' adjective (archaic)">
<!ENTITY adj-ku "`ku' adjective (archaic)">
<!ENTITY adj-shiku "`shiku' adjective (archaic)">
<!ENTITY adj-nari "archaic/formal form of na-adjective">
<!ENTITY n-pr "proper noun">
<!ENTITY v-unspec "verb unspecified">
<!ENTITY v4k "Yodan verb with `ku' ending (archaic)">
<!ENTITY v4g "Yodan verb with `gu' ending (archaic)">
<!ENTITY v4s "Yodan verb with `su' ending (archaic)">
<!ENTITY v4t "Yodan verb with `tsu' ending (archaic)">
<!ENTITY v4n "Yodan verb with `nu' ending (archaic)">
<!ENTITY v4b "Yodan verb with `bu' ending (archaic)">
<!ENTITY v4m "Yodan verb with `mu' ending (archaic)">
<!ENTITY v2k-k "Nidan verb (upper class) with `ku' ending (archaic)">
<!ENTITY v2g-k "Nidan verb (upper class) with `gu' ending (archaic)">
<!ENTITY v2t-k "Nidan verb (upper class) with `tsu' ending (archaic)">
<!ENTITY v2d-k "Nidan verb (upper class) with `dzu' ending (archaic)">
<!ENTITY v2h-k "Nidan verb (upper class) with `hu/fu' ending (archaic)">
<!ENTITY v2b-k "Nidan verb (upper class) with `bu' ending (archaic)">
<!ENTITY v2m-k "Nidan verb (upper class) with `mu' ending (archaic)">
<!ENTITY v2y-k "Nidan verb (upper class) with `yu' ending (archaic)">
<!ENTITY v2r-k "Nidan verb (upper class) with `ru' ending (archaic)">
<!ENTITY v2k-s "Nidan verb (lower class) with `ku' ending (archaic)">
<!ENTITY v2g-s "Nidan verb (lower class) with `gu' ending (archaic)">
<!ENTITY v2s-s "Nidan verb (lower class) with `su' ending (archaic)">
<!ENTITY v2z-s "Nidan verb (lower class) with `zu' ending (archaic)">
<!ENTITY v2t-s "Nidan verb (lower class) with `tsu' ending (archaic)">
<!ENTITY v2d-s "Nidan verb (lower class) with `dzu' ending (archaic)">
<!ENTITY v2n-s "Nidan verb (lower class) with `nu' ending (archaic)">
<!ENTITY v2h-s "Nidan verb (lower class) with `hu/fu' ending (archaic)">
<!ENTITY v2b-s "Nidan verb (lower class) with `bu' ending (archaic)">
<!ENTITY v2m-s "Nidan verb (lower class) with `mu' ending (archaic)">
<!ENTITY v2y-s "Nidan verb (lower class) with `yu' ending (archaic)">
<!ENTITY v2r-s "Nidan verb (lower class) with `ru' ending (archaic)">
<!ENTITY v2w-s "Nidan verb (lower class) with `u' ending and `we' conjugation (archaic)">
<!ENTITY archit "architecture term">
<!ENTITY astron "astronomy, etc. term">
<!ENTITY baseb "baseball term">
<!ENTITY biol "biology term">
<!ENTITY bot "botany term">
<!ENTITY bus "business term">
<!ENTITY econ "economics term">
<!ENTITY engr "engineering term">
<!ENTITY finc "finance term">
<!ENTITY geol "geology, etc. term">
<!ENTITY law "law, etc. term">
<!ENTITY mahj "mahjong term">
<!ENTITY med "medicine, etc. term">
<!ENTITY music "music term">
<!ENTITY Shinto "Shinto term">
<!ENTITY shogi "shogi term">
<!ENTITY sports "sports term">
<!ENTITY sumo "sumo term">
<!ENTITY zool "zoology term">
<!ENTITY joc "jocular, humorous term">
<!ENTITY anat "anatomical term">
<!ENTITY Christn "Christian term">
<!ENTITY net-sl "Internet slang">
<!ENTITY dated "dated term">
<!ENTITY hist "historical term">
<!ENTITY lit "literary or formal term">
<!ENTITY litf "literary or formal term">
<!ENTITY surname "family or surname">
<!ENTITY place "place name">
<!ENTITY unclass "unclassified name">
<!ENTITY company "company name">
<!ENTITY product "product name">
<!ENTITY work "work of art, literature, music, etc. name">
<!ENTITY person "full name of a particular person">
<!ENTITY given "given name or forename, gender not specified">
<!ENTITY station "railway station">
<!ENTITY organization "organization name">
]>

and one example:

<JMdict>
    <entry>
        <ent_seq>1000000</ent_seq>
        <r_ele>
            <reb>ヽ</reb>
        </r_ele>
        <sense>
            <pos>&unc;</pos>
            <xref>一の字点</xref>
            <gloss g_type="expl">repetition mark in katakana</gloss>
        </sense>
        <sense>
            <gloss xml:lang="dut">hitotsuten 一つ点: teken dat herhaling van het voorafgaande katakana-schriftteken aangeeft</gloss>
        </sense>
    </entry>

    // (...)
</JMdict>

the entry->sense->pos tag should expand &unc; into unclassified because of <!ENTITY unc "unclassified">, but in the resulting struct it comes up empty.

The underlying Parser is picking up on the entities, because it will through an Error Domain=NSXMLParserErrorDomain Code=111 "(null)" if even one of the entities used in the xml files is missing from the definitions in the DOCTYPE header.

MartinP7r commented 4 years ago

Some information about entity tags: https://www.logicbig.com/tutorials/misc/xml/xml-entity.html#:~:text=Internal%20Entities%3A%20An%20internal%20entity,defined%20in%20an%20separate%20file.

maybe relevant apple documentation: https://developer.apple.com/documentation/foundation/nsxmlparserdelegate/1412907-parser parser:foundUnparsedEntityDeclarationWithName:publicID:systemID:notationName:

https://developer.apple.com/documentation/foundation/nsxmlparserdelegate/1414803-parser parser:foundInternalEntityDeclarationWithName:value: This seems like it would be necessary to decode the entity shortcuts.

relevant in case of external entity declaration (entity declaration resides in other file in other file) https://developer.apple.com/documentation/foundation/nsxmlparserdelegate/1416221-parser

MartinP7r commented 4 years ago

I forked your project and wrote a test case that fails as expected:

final class EntityTests: XCTestCase {

    let xml = """
    <!DOCTYPE note [
        <!ENTITY jd "John Doe">
    ]>

    <note>
        <author>&jd;</author>
    </note>
    """

    struct Note: Decodable {
        let author: String
    }

    func testEntityIsExpanded() throws {
        let decoded = try XMLDecoder().decode(Note.self,
                                              from: xml.data(using: .utf8)!)

        XCTAssertEqual(decoded.author, "John Doe")
    }

}
XCTAssertEqual failed: ("") is not equal to ("John Doe")

Consider that it says ("") is not equal to ("John Doe") not ("&jd;") is not equal to ("John Doe")

I've only just started looking into your implementation, but if you'd be interested I'd try add the feature for decoding (internal) entity definitions and make a pull request. If it's actually implementable with XMLParserDelegate...

edit: actually doesn't look too good: https://stackoverflow.com/questions/44680734/parsing-xml-with-entities-in-swift-with-xmlparser

MartinP7r commented 4 years ago

another (10 year old) stackoverflow comment and 5 year old radar (also https://www.mail-archive.com/cocoa-dev@lists.apple.com/msg67796.html) states that NSXMLParser doesn't pick up on entities other than the standard ones and will just remove them or through an error if they are not defined.

This seems to be the case.

As for my specific case, I will probably try and see if replacing them one by one before running the parser is somewhat efficient.

Another possible solution would be to save all ENTITY definitions that get picked up by parser(parser: XMLParser, parseErrorOccurred parseError: NSError) and then replace then check and replace them within parser(_ parser: XMLParser, foundCharacters string: String)

edit: I just tried a strategy to replace within parser(_ parser: XMLParser, foundCharacters string: String) and sadly the &...; term doesn't even make it there and seems to get replace beforehand.