Avoid unnecessary/duplicate reading of documents, depending on mime type

johann-petrak commented 4 years ago

Currently, if a document should get read from a URL in some specific format, the document format code for reading only gets a chance to execute once a URL has been read.

For binary formats, this approach has the following problems:

the file content always gets read into a String which means it must be possible to represent it as a String. However this depends on the encoding specified as a parameter (or the default used). There is no way to guarantee that the String will end up being an unambiguous representation of the original byte stream or even that reading will not fail
Therefore, some formats just throw away the data that has been read and use the sourceUrl parameter to re-read form the URL again which means that every file in such a format needs to get read twice. This will severely impact performance especially in those cases where a binary format would be desirable to improve performance
if the format is based on selecting only part of a larger file (e.g. extracting text out of a movie file), that approach would unnecessarily try to store everything in memory

It should be easy to change this behaviour in the following way:

There will be a no interface for document formats which want to read a URL themselves
the interface provides a method with parameters URL and mimetype (or all document parameters) which returns true or false, depending on whether the document format wants to read the URL itself
During document construction when creating from a URL, the document format registered for the mime type is checked if it implements that new format and if yes, the method is called with the parameters known for that document.
if the method indicates that the document format wants to read, the constructor does everything except reading from the URL and instead sets the document content to an empty content.
Once the document format code takes control the actual reading and setting of the document content happens in whatever way necessary

This would already fix all important cases and document formats which need to read from a URL by themselves could simply refuse to work for creating documents from a string. However if the document gets created from a string, the document format could instead make an attempt to read from the given string using an encoding that allows to treat the string as a sequence of bytes.

greenwoodma commented 4 years ago

I got bitten by this recently. It turns out that even when reading XML files that are UTF-8 encoded, if you don't specify UTF-8 when first creating the document you might run into issues. The problem only happens if the XML contains a BoM. When that happens the BoM can be corrupted when the file is read in the wrong encoding (specifically Windows-1252) meaning the BoM stripping code we have can't see the BoM so it passes it to the XML parser which fails with "content before prolog". And of course even if I hard code UTF-8 inside the reader I pass to the parser, it won't fix the BoM. The only solution would be to read the document from the original URL a second time.

This will be less of an issue in GATE 9.0-SNAPSHOT or above where we now default to UTF-8 if no encoding is specified, but it's a different example of why it might be useful to avoid reading a file twice, or before we know exactly what we are doing with it.

johann-petrak commented 4 years ago

I have code for this but not pushed yet, because I wanted to test it with a format first (for which I do not have the code yet)

greenwoodma commented 4 years ago

Cool, let me know if you want me to test something, or want me to try implementing the new version (happy to do that for the ALTO XML plugin which is where I hit the problem described above).

johann-petrak commented 4 years ago

OK, I pushed branch issue115.

I have quickly tested this now with a format I am working on.

To use it:

make the docformat class extends DocumentFormat implements BinaryDocumentFormat
implement method public boolean shouldReadFromUrl(MimeType mimeType, URL sourceUrl) and make it return false
docformat will gate a document with empty DocumentContent and can use sourceUrl to read directly

johann-petrak commented 4 years ago

This just does an additional check in DocumentImpl.init() to see if the document content should be read from the URL. This is done by calling static DocumentFormat.shouldReadFromUrl(mimeType, sourceUrl) which contains all the logic for checking. Since this is about speed, that code does not read from the Url to find the docformat! Once the docformat has been found, we check if it is a BinaryDocumentFormat and if yes, ask if we should read from the URL given the set mimtetype and url

greenwoodma commented 4 years ago

This is really cool, I do have a couple of comments/suggestions.

the logic of the shouldReadFromUrl methods confused me for ages. Given the methods are on DocumentFormat instances I think it should return true if the document format should read from the URL directly and false otherwise. That's the opposite of the current logic.
I'm not sure BinaryDocumentFormat is the best class name, especially given that the situation I want to use this with first is with an XML file. Having said that I can't think of a better option. The TrustMeIKnowBestDocumentFormat just doesn't really trip off the tongue :D
I think that the BinaryDocumentFormat is actually overly complex. I think it should probably be more like Serializable which doesn't have any methods but just flags that the class should be handled differently. In this case if you implement the interface then we know you can handle reading direct from the URL so we don't need the method (it would always return false, given the current logic), especially as I can't see any need to pass the mimeType or URL through to the interface method either.

Does that all make sense?

ianroberts commented 4 years ago

I'm not sure BinaryDocumentFormat is the best class name, especially given that the situation I want to use this with first is with an XML file. Having said that I can't think of a better option. The TrustMeIKnowBestDocumentFormat just doesn't really trip off the tongue :D

How about DirectLoadingDocumentFormat (as in "document format that loads directly from the source rather than the usual behaviour of GATE loading the content first and then passing it on").

Question: this would potentially be a backwards-incompatible change, but this mechanism would also give us the opportunity to make the standard XML document format default to respecting the encoding from the <?xml version="1.0" encoding="something"?> declaration instead of using the platform default. Is this worth doing for version 9.0?

greenwoodma commented 4 years ago

How about DirectLoadingDocumentFormat (as in "document format that loads directly from the source rather than the usual behaviour of GATE loading the content first and then passing it on").

That works well :+1:

Question: this would potentially be a backwards-incompatible change, but this mechanism would also give us the opportunity to make the standard XML document format default to respecting the encoding from the <?xml version="1.0" encoding="something"?> declaration instead of using the platform default. Is this worth doing for version 9.0?

I'm not viewing it as a backwards-incompatible change as such. It just means that plugins built against this won't load in older versions, which is fine. We've broken that kind of backwards compatibility before.

I think we should certainly adopt this for all the built in document formats (where it makes sense) including the Tika formats (which is where we started) as this doesn't break anything and should be fairly transparent to both the UI and the API. I'm assuming that for XML files, if we read from the URL directly then the parsers would respect the encoding attribute, so yes that is probably another document format that should be upgraded in this way.

One further question though, if the user has specified an encoding when creating the document should we pass that to whatever reader we create, or should we enforce the encoding we know to be best to allow them to overwrite in specific situations. For example, in most cases we want UTF-8 for XML files, but if someone deliberately sets the encoding to Windows-1252 should we honour that setting. My feeling is yes, if they've specifically set the encoding, as this will also mean behaviour stays consistent with previous GATE versions.

johann-petrak commented 4 years ago

I am not sure I understand what you mean -- the static shouldReadFromUrl method has been implemented just now for the purpose of figuring out if reading should be delayed and it returns whatever the BinaryDocumentFormat says should be done for that mimetype/url or it returns true to stay backwards compatible if there is no binaryDocumentFormat we can find.
i am fine with DirectLoadingDocumentFormat or whatever you like better :)
I think just implementing the shouldReadFromUrl method is not that complex, but it adds a lot of flexibility and allows the document format to be the one to make the actual decision about when to read the content because if the format-agnostic code finally finds our document format to be the one responsible for the url or mime type, it may still be ok to directly read into the content (since a docformat can have many extensions and mimetimes registered). On the other hand, if a document format can do everything that may happen now after direct reading as well, then we can just use an indicator interface and call that code (I was not sure about that, e.g. the collect repositioning info stuff etc)

johann-petrak commented 4 years ago

One further question though, if the user has specified an encoding when creating the document should we pass that to whatever reader we create, or should we enforce the encoding we know to be best to allow them to overwrite in specific situations. For example, in most cases we want UTF-8 for XML files, but if someone deliberately sets the encoding to Windows-1252 should we honour that setting. My feeling is yes, if they've specifically set the encoding, as this will also mean behaviour stays consistent with previous GATE versions.

Is there some kind of standard or recommendation for how to do this when both the XML header and the user specify an encoding? Because we could have all four combinations of where an encoding is specified, but the one where both specify different encodings needs some decision be made or an error/warning message be shown.

greenwoodma commented 4 years ago

I think the confusion is that a method that's called shouldReadFromUrl on DocumentFormat should return true if the DocumentFormat should read from the URL itself. That's the opposite of how you currently interpret the result; i.e. if the format wants to read directly it needs to return false, which I just find very confusing.
I think my point was, you are only calling this method once you've determined that it can handle the specific mimeType so passing that through seems redundant. I guess the only occasion where you would need this would be if a document format supports multiple mimeTypes but for some reason can't read from a URL for one of them. That seems highly unlikely. If the code can read from a URL for one mime type I can't see why it would use the existing document content for others. In theory, for performance reasons alone, it would be nice to shift to this essentially being the default approach, so I think forcing a document format to support this for all mimeTypes it supports is probably a good way of nudging people in the right direction, at which point the method becomes redundant because if you implement the new interface you will always want to read from the URL, otherwise you wouldn't implement this interface.

greenwoodma commented 4 years ago

Is there some kind of standard or recommendation for how to do this when both the XML header and the user specify an encoding? Because we could have all four combinations of where an encoding is specified, but the one where both specify different encodings needs some decision be made or an error/warning message be shown.

Yes, I don't know. My feeling would be always to assume the user knows best and if they've set an encoding we should use it, but I don't know if there is any standard approach to doing this.

ianroberts commented 4 years ago

One further question though, if the user has specified an encoding when creating the document should we pass that to whatever reader we create, or should we enforce the encoding we know to be best to allow them to overwrite in specific situations. For example, in most cases we want UTF-8 for XML files, but if someone deliberately sets the encoding to Windows-1252 should we honour that setting. My feeling is yes, if they've specifically set the encoding, as this will also mean behaviour stays consistent with previous GATE versions.

If there's an encoding specified as a parameter then use that, definitely. I'm just suggesting that the default when encoding==null could now be to trust the parser rather than to assume the system default. How to implement this is a different question, because currently the translation from null to system-default happens inside DocumentImpl.getEncoding, so there's no way for anything downstream of that to tell the difference between "not specified" vs "specified with a value that happens to match the platform default".

greenwoodma commented 4 years ago

Right. I think the default in 9.0-SNAPSHOT is now to use UTF-8 if the user doesn't state an encoding (there was an issue for this somewhere), but yes not sure how to deal with "not specified" versus "specified" with the default.

johann-petrak commented 4 years ago

I am fine with using DirectLoadingDocumentFormat as an indicator interface so the method shouldReadFromUrl in the DocumentFormat that implements that interface would not be required any more and we do not have any logic issues. The shouldReadFromUrl method in DocumentFormat would then return false if it finds a DirectLoadingDocumentFormat for the mimetype/url or true otherwise. (I basically did not want to take away flexibility initially (and I also wasn't sure if some of the code that runs after the document content gets created in the DocumentImpl.init() method may not sometimes be useful to allow to run if a docformat supports several mimetypes/url-patterns. But I guess if we really need that we can still refactor that coding into DocumentFormat and call it from there. )

ianroberts commented 4 years ago

Is there some kind of standard or recommendation for how to do this when both the XML header and the user specify an encoding?

Yes. From the XML spec:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

My bold - I think the encoding parameter of DocumentImpl counts as "information provided by an external transport protocol".

greenwoodma commented 4 years ago

The shouldReadFromUrl method in DocumentFormat would then return false if it finds a DirectLoadingDocumentFormat for the mimetype/url or true otherwise.

I think that's still backwards. What you have is a call to DocumentFormat.shouldReadFromUrl(mimetype, url) which returns false if the DocumentFormat should read from the URL itself rather than the content being read in advance into the document content. To me that seems to be backwards; if the document format is going to read from the URL itself I'd expect that method to return true.

It's a minor thing though and as long as it's document I don't suppose it matters much, especially given the removal of the method from the interface so API users won't have to implement the method themselves.

johann-petrak commented 4 years ago

Oh I see, so the confusion is if the method is saying the caller should read or the document format should read (I was always thinking of the caller asking and getting told what to do). I am really fine with either interpretation or we find a more unambiguous method name :)

greenwoodma commented 4 years ago

Yes exactly. My assumption was that it should say if the format was going to read or not, but I agree you could interpret it both ways. As long as it's documented I doubt it really matters.

johann-petrak commented 4 years ago

How about we call it DocumentFormat.willReadFromUrl(mimetype, url) and document it?

johann-petrak commented 4 years ago

ok, have updated the branch accordingly

johann-petrak commented 4 years ago

Merged into master so I can make use of it in the format-bdoc plugin.

johann-petrak commented 3 years ago

Implemented and test for some time now, closing.

GateNLP / gate-core

Avoid unnecessary/duplicate reading of documents, depending on mime type #115