Closed johann-petrak closed 3 years ago
I got bitten by this recently. It turns out that even when reading XML files that are UTF-8 encoded, if you don't specify UTF-8 when first creating the document you might run into issues. The problem only happens if the XML contains a BoM. When that happens the BoM can be corrupted when the file is read in the wrong encoding (specifically Windows-1252) meaning the BoM stripping code we have can't see the BoM so it passes it to the XML parser which fails with "content before prolog". And of course even if I hard code UTF-8 inside the reader I pass to the parser, it won't fix the BoM. The only solution would be to read the document from the original URL a second time.
This will be less of an issue in GATE 9.0-SNAPSHOT or above where we now default to UTF-8 if no encoding is specified, but it's a different example of why it might be useful to avoid reading a file twice, or before we know exactly what we are doing with it.
I have code for this but not pushed yet, because I wanted to test it with a format first (for which I do not have the code yet)
Cool, let me know if you want me to test something, or want me to try implementing the new version (happy to do that for the ALTO XML plugin which is where I hit the problem described above).
OK, I pushed branch issue115.
I have quickly tested this now with a format I am working on.
To use it:
extends DocumentFormat implements BinaryDocumentFormat
public boolean shouldReadFromUrl(MimeType mimeType, URL sourceUrl)
and make it return false This just does an additional check in DocumentImpl.init() to see if the document content should be read from the URL. This is done by calling static DocumentFormat.shouldReadFromUrl(mimeType, sourceUrl) which contains all the logic for checking. Since this is about speed, that code does not read from the Url to find the docformat! Once the docformat has been found, we check if it is a BinaryDocumentFormat and if yes, ask if we should read from the URL given the set mimtetype and url
This is really cool, I do have a couple of comments/suggestions.
shouldReadFromUrl
methods confused me for ages. Given the methods are on DocumentFormat
instances I think it should return true
if the document format should read from the URL directly and false otherwise. That's the opposite of the current logic.BinaryDocumentFormat
is the best class name, especially given that the situation I want to use this with first is with an XML file. Having said that I can't think of a better option. The TrustMeIKnowBestDocumentFormat
just doesn't really trip off the tongue :DBinaryDocumentFormat
is actually overly complex. I think it should probably be more like Serializable
which doesn't have any methods but just flags that the class should be handled differently. In this case if you implement the interface then we know you can handle reading direct from the URL so we don't need the method (it would always return false, given the current logic), especially as I can't see any need to pass the mimeType or URL through to the interface method either.Does that all make sense?
I'm not sure
BinaryDocumentFormat
is the best class name, especially given that the situation I want to use this with first is with an XML file. Having said that I can't think of a better option. TheTrustMeIKnowBestDocumentFormat
just doesn't really trip off the tongue :D
How about DirectLoadingDocumentFormat
(as in "document format that loads directly from the source rather than the usual behaviour of GATE loading the content first and then passing it on").
Question: this would potentially be a backwards-incompatible change, but this mechanism would also give us the opportunity to make the standard XML document format default to respecting the encoding from the <?xml version="1.0" encoding="something"?>
declaration instead of using the platform default. Is this worth doing for version 9.0?
How about
DirectLoadingDocumentFormat
(as in "document format that loads directly from the source rather than the usual behaviour of GATE loading the content first and then passing it on").
That works well :+1:
Question: this would potentially be a backwards-incompatible change, but this mechanism would also give us the opportunity to make the standard XML document format default to respecting the encoding from the
<?xml version="1.0" encoding="something"?>
declaration instead of using the platform default. Is this worth doing for version 9.0?
I'm not viewing it as a backwards-incompatible change as such. It just means that plugins built against this won't load in older versions, which is fine. We've broken that kind of backwards compatibility before.
I think we should certainly adopt this for all the built in document formats (where it makes sense) including the Tika formats (which is where we started) as this doesn't break anything and should be fairly transparent to both the UI and the API. I'm assuming that for XML files, if we read from the URL directly then the parsers would respect the encoding attribute, so yes that is probably another document format that should be upgraded in this way.
One further question though, if the user has specified an encoding when creating the document should we pass that to whatever reader we create, or should we enforce the encoding we know to be best to allow them to overwrite in specific situations. For example, in most cases we want UTF-8 for XML files, but if someone deliberately sets the encoding to Windows-1252 should we honour that setting. My feeling is yes, if they've specifically set the encoding, as this will also mean behaviour stays consistent with previous GATE versions.
One further question though, if the user has specified an encoding when creating the document should we pass that to whatever reader we create, or should we enforce the encoding we know to be best to allow them to overwrite in specific situations. For example, in most cases we want UTF-8 for XML files, but if someone deliberately sets the encoding to Windows-1252 should we honour that setting. My feeling is yes, if they've specifically set the encoding, as this will also mean behaviour stays consistent with previous GATE versions.
Is there some kind of standard or recommendation for how to do this when both the XML header and the user specify an encoding? Because we could have all four combinations of where an encoding is specified, but the one where both specify different encodings needs some decision be made or an error/warning message be shown.
shouldReadFromUrl
on DocumentFormat should return true if the DocumentFormat should read from the URL itself. That's the opposite of how you currently interpret the result; i.e. if the format wants to read directly it needs to return false, which I just find very confusing.Is there some kind of standard or recommendation for how to do this when both the XML header and the user specify an encoding? Because we could have all four combinations of where an encoding is specified, but the one where both specify different encodings needs some decision be made or an error/warning message be shown.
Yes, I don't know. My feeling would be always to assume the user knows best and if they've set an encoding we should use it, but I don't know if there is any standard approach to doing this.
One further question though, if the user has specified an encoding when creating the document should we pass that to whatever reader we create, or should we enforce the encoding we know to be best to allow them to overwrite in specific situations. For example, in most cases we want UTF-8 for XML files, but if someone deliberately sets the encoding to Windows-1252 should we honour that setting. My feeling is yes, if they've specifically set the encoding, as this will also mean behaviour stays consistent with previous GATE versions.
If there's an encoding specified as a parameter then use that, definitely. I'm just suggesting that the default when encoding==null could now be to trust the parser rather than to assume the system default. How to implement this is a different question, because currently the translation from null to system-default happens inside DocumentImpl.getEncoding
, so there's no way for anything downstream of that to tell the difference between "not specified" vs "specified with a value that happens to match the platform default".
Right. I think the default in 9.0-SNAPSHOT is now to use UTF-8 if the user doesn't state an encoding (there was an issue for this somewhere), but yes not sure how to deal with "not specified" versus "specified" with the default.
I am fine with using DirectLoadingDocumentFormat as an indicator interface so the method shouldReadFromUrl in the DocumentFormat that implements that interface would not be required any more and we do not have any logic issues. The shouldReadFromUrl method in DocumentFormat would then return false if it finds a DirectLoadingDocumentFormat for the mimetype/url or true otherwise. (I basically did not want to take away flexibility initially (and I also wasn't sure if some of the code that runs after the document content gets created in the DocumentImpl.init() method may not sometimes be useful to allow to run if a docformat supports several mimetypes/url-patterns. But I guess if we really need that we can still refactor that coding into DocumentFormat and call it from there. )
Is there some kind of standard or recommendation for how to do this when both the XML header and the user specify an encoding?
Yes. From the XML spec:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.
My bold - I think the encoding
parameter of DocumentImpl counts as "information provided by an external transport protocol".
The shouldReadFromUrl method in DocumentFormat would then return false if it finds a DirectLoadingDocumentFormat for the mimetype/url or true otherwise.
I think that's still backwards. What you have is a call to DocumentFormat.shouldReadFromUrl(mimetype, url)
which returns false if the DocumentFormat should read from the URL itself rather than the content being read in advance into the document content. To me that seems to be backwards; if the document format is going to read from the URL itself I'd expect that method to return true.
It's a minor thing though and as long as it's document I don't suppose it matters much, especially given the removal of the method from the interface so API users won't have to implement the method themselves.
Oh I see, so the confusion is if the method is saying the caller should read or the document format should read (I was always thinking of the caller asking and getting told what to do). I am really fine with either interpretation or we find a more unambiguous method name :)
Yes exactly. My assumption was that it should say if the format was going to read or not, but I agree you could interpret it both ways. As long as it's documented I doubt it really matters.
How about we call it DocumentFormat.willReadFromUrl(mimetype, url) and document it?
ok, have updated the branch accordingly
Merged into master so I can make use of it in the format-bdoc plugin.
Implemented and test for some time now, closing.
Currently, if a document should get read from a URL in some specific format, the document format code for reading only gets a chance to execute once a URL has been read.
For binary formats, this approach has the following problems:
It should be easy to change this behaviour in the following way:
This would already fix all important cases and document formats which need to read from a URL by themselves could simply refuse to work for creating documents from a string. However if the document gets created from a string, the document format could instead make an attempt to read from the given string using an encoding that allows to treat the string as a sequence of bytes.