GateNLP / gate-core

The GATE Embedded core API and GATE Developer application
GNU Lesser General Public License v3.0
75 stars 29 forks source link

GATE xml document created on Windows cannot be loaded on Linux #100

Closed johann-petrak closed 4 years ago

johann-petrak commented 4 years ago

Using 8.7 snapshot on both systems, java AdoptOpenJDK 13.0.1 on Windows and Java 1.8.0_232 on Linux.

I create a new document on Windows in the GUI: entered text and copy pasted an emoji. Then ran ANNIE on the document and saved it to GATE xml.

The gate xml document got transferred to linux using git or email. In both cases, when I load the document in the GUI I get this error:

Warning: Document remains unparsed. 

  Stack Dump: 
com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xd83d
 at [row,col,system-id]: [21,70,"file:/data/johann/Downloads/tiny1.xml"]
    at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:606)
    at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:479)
    at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2448)
    at com.ctc.wstx.sr.StreamScanner.validateChar(StreamScanner.java:2377)
    at com.ctc.wstx.sr.StreamScanner.resolveCharEnt(StreamScanner.java:2361)
    at com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1507)
    at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2746)
    at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1072)
    at gate.corpora.DocumentStaxUtils.readTextWithNodes(DocumentStaxUtils.java:485)
    at gate.corpora.DocumentStaxUtils.readGateXmlDocument(DocumentStaxUtils.java:145)
    at gate.corpora.XmlDocumentFormat.unpackGateFormatMarkup(XmlDocumentFormat.java:183)
    at gate.corpora.XmlDocumentFormat.unpackMarkup(XmlDocumentFormat.java:133)
    at gate.corpora.XmlDocumentFormat.unpackMarkup(XmlDocumentFormat.java:85)
    at gate.corpora.DocumentImpl.init(DocumentImpl.java:315)
    at gate.Factory.createResource(Factory.java:430)
    at gate.gui.NewResourceDialog$4.run(NewResourceDialog.java:270)
    at java.lang.Thread.run(Thread.java:748)
greenwoodma commented 4 years ago

That's nasty. Can you attach the XML doc to the issue (or e-mail it to me)

johann-petrak commented 4 years ago

github does not allow to attach a xml document, but it is here:

https://github.com/GateNLP/gateplugin-python/blob/f89347b48e5bae0acd94d1902775dce64c2f5369/examples/docs/tiny1.xml

johann-petrak commented 4 years ago

oh, it is even weirder, the document cannot be re-loaded in windows either!

johann-petrak commented 4 years ago

I also tried with FastInfoset and the new BdocJson format and both formats can save/reload without problems. So this is specific to the XML serialization.

ianroberts commented 4 years ago

So the serializer is writing the supplementary character as two separate character references for the two halves of the surrogate pair, rather than one reference for the whole code point. Does it do the same thing if you don't run ANNIE first (just create the doc and save it with no annotations)?

ianroberts commented 4 years ago

And I suspect the reason we haven't spotted this happening before is because it'll be specific to the fact that the document is encoding "windows-1252" - if it were UTF-8 then the supplementary character would probably have been written out literally rather than as a character reference.

johann-petrak commented 4 years ago

Oddly when I save a fresh document with the same content and no ANNIE and then try to load it, I immediately get an error popup with the same error message, and I stay in the Gate document parameters dialog, while in the original case that dialog got dismissed and the error got only logged to the message pane. But yes, same error.

johann-petrak commented 4 years ago

Is there a reason why the GATE xml should not always use UTF-8?

ianroberts commented 4 years ago

https://bugs.openjdk.java.net/browse/JDK-8073700 is the only reference I can find to this issue; (a) it's from 2015 and (b) it's in the JDK-internal XMLStreamWriter implementation rather than the woodstox one.

ianroberts commented 4 years ago

Is there a reason why the GATE xml should not always use UTF-8?

The writer has always respected the encoding parameter from the document when choosing the encoding to use for output - you can override it from code by creating your own XMLStreamWriter and calling DocumentStaxUtils on that but for the GUI it uses the document encoding.

johann-petrak commented 4 years ago

I mean using UTF-8 as the default when writing and nothing is specified. But actually I just tested this: entering utf-8 in the encoding field, then creating the content with the emoji, saving, reloading. The result is that while no exception is thrown, wrong characters are shown where the emoji should be. When I look at the raw xml, it does have utf-8 in the version but the text is already garbled.

This is shown in GATE after re-loading:

tiny 💩 tiny

This is shown when looking at the XML: <TextWithNodes>tiny 💩 tiny</TextWithNodes>

ianroberts commented 4 years ago

saving, reloading.

And did you specify UTF-8 when reloading as well? The encoding of a GATE document defaults to the platform-specific default in all cases, it doesn't look at the <?xml declaration.

greenwoodma commented 4 years ago

What we really need is Microsoft to fix Windows and use a sensible document encoding as the default, and not the ridiculously silly windows-1252

ianroberts commented 4 years ago

We've had this argument before about what the default encoding should be and whether we should make a special case for XML and look at the XML declaration in the file, but it's a chicken and egg problem - we have already created the DocumentImpl by the time we get to choosing which DocumentFormat to parse its markup with.

johann-petrak commented 4 years ago

And did you specify UTF-8 when reloading as well? The encoding of a GATE document defaults to the platform-specific default in all cases, it doesn't look at the <?xml declaration.

Why? If the encoding of the serialized XML can be whatever is chosen at write time, why would it then not use what is saved? The user has to know the encoding rather than GATE using the info stored with the document?

johann-petrak commented 4 years ago

OK, when I specify UTF-8 at load time I get back what I should and it works

johann-petrak commented 4 years ago

We've had this argument before about what the default encoding should be and whether we should make a special case for XML and look at the XML declaration in the file, but it's a chicken and egg problem - we have already created the DocumentImpl by the time we get to choosing which DocumentFormat to parse its markup with.

That is still a technical design problem though -- all the poor souls who use XML have to deal with the flawed way of how XML does this. But my original question was really why not just always default to UTF-8 if nothing is specified (and maybe strongly discourage ever entering anything in the encoding field, maybe even getting rid of it at some point). For GATE XML, there should be nothing that cannot be represented with UTF-8 and I do not think there are still systems in use which do not support UTF-8.

ianroberts commented 4 years ago

We definitely need to be able to load documents in other encodings but I agree there could be an argument for making UTF-8 the default (and if we're going to break backwards compatibility like that then a new major release is the time to do so).

johann-petrak commented 4 years ago

Yes, thats what I meant. The encoding field could still be around (for a while) to allow reading old documents in different encodings. But otherwise using UTF-8 should spread happyness! The bug originally raised here would then also be less critical because users would have to explicitly use the windows-1252 encoding on saving to make it happen.

greenwoodma commented 4 years ago

Surely defaulting to UTF-8 just makes it more awkward for Windows users as they are more likely to end up with the wrong encoding.

I agree though if we do want to break things now would be the time to do so. What I'd do is make UTF-8 the default value for the encoding param so users can see that it's being used when they load a document; i.e. it would be in the dialog box. They would then have to change it when they needed to.

ianroberts commented 4 years ago

But still leave it optional, so if you specifically want the old behaviour of taking the default for the current platform then you can do so by manually setting it to blank.

greenwoodma commented 4 years ago

Yes, that works for me. I'll add it to my list for 9.0