BaseXdb / basex

BaseX Main Repository.
http://basex.org
BSD 3-Clause "New" or "Revised" License
695 stars 264 forks source link

Ignore encoding from XML declaration when parsing a String value #2255

Closed GuntherRademacher closed 1 year ago

GuntherRademacher commented 1 year ago

Some time ago I was using a Java tool (can't remember what it was) that generated XML in a Java string starting with an XML declaration of

    <?xml version="1.0" encoding="UTF-16"?>...

When passing that via IO.get to the DBNode constructor,

    new DBNode(IO.get(xml))

it failed with a parsing error, because the string is internally encoded in UTF-8, the resulting byte stream is then passed to the XML parser, which decodes it per the encoding from the XML declaration. At the time I had fixed this in the application by omitting the XML declaration, but later I realized that the same is reproducible by a query like

    parse-xml('<?xml version="1.0" encoding="utf-16"?><xml/>')

This PR is a proposal for fixing this, by making making IOContent supply a Reader that will decode the byte array as UTF-8, and have the XML parser ignore the encoding presented in the XML declaration.

GuntherRademacher commented 1 year ago

This PR is a proposal for fixing this, by making making IOContent supply a Reader that will decode the byte array as UTF-8, and have the XML parser ignore the encoding presented in the XML declaration.

My first attempt to fix this was neglecting the fact that the encoding of IOContent can be different from UTF-8, as in FetchModuleTest.binaryDoc. Also for some reason I missed to use InputSource.setEncoding.

I have now replaced it by

ChristianGruen commented 1 year ago

…merged (with just some minor changes).