GateNLP / gate-core

The GATE Embedded core API and GATE Developer application
GNU Lesser General Public License v3.0
75 stars 29 forks source link

Use document formats registered for a mime type or extension for saving through the api #116

Closed johann-petrak closed 4 years ago

johann-petrak commented 4 years ago

Currently, a document format can register itself to get invoked for a mime type and/or extension. If a document is read from a URL with the given extension or mime-type set, the document format is invoked automatically, no matter if the reading is done through the GUI or using the standard GATE API.

However, with saving a document, this is not the case. There is no way to automatically save a document using the correct document format code based on its extension or a provided mime type using the API. Instead the client code needs to know which plugin and document format is required in order to save the document and invoke the correct method . All of those are different from the standard way to save a document in GATE's own XML format. In the GUI, the association is done only for the GUI through adding to the action menus only.

It would be much better, if document formats could register themselves for mime types and extensions for saving as well and there would be one API method SomeClass.save(doc, URL, mimetype) or similar that can always be used to save the document, whatever the format, including GATE's original XML format.

greenwoodma commented 4 years ago

I've just had a look at doing this and almost got a complete implementation done, before realising that it fails even just with the two default exporters available in GATE.

The problem is that, like with document formats for loading, you can easily have multiple exporters that write to files with the same extension and mimetype but which produce very different output. For example, the two default exporters (GATE XML and Inline XML) both use xml and text/xml for the extension and mimetype respectively. That's perfectly valid as both do write XML files but it means that you can't tell them apart by the extension or mimetype. We have a number of other formats that probably clash in the same way. Whilst I could add something that works when it can only find one possible exporter given the extension or mimetype that seems dangerous as it would behave differently depending on which plugins you had loaded.

The current approach of code specifying the unambiguous classname to the exporter you want to use is analogous to the way the factory works for creating GATE resources and seems the logical compromise; API users are already used to using the classname to create resources, so using it to fetch the relevant exporter shouldn't be an issue.

For now I'm going to close the issue (mostly as I'm trying to clean up and figure out what's left for the next release). If anyone can come up with a sensible approach that works in a logical fashion regardless of how many plugins offer exporters for the same extension or mimetype then I'd be happy to revisit this to make the API easier for people to use.

johann-petrak commented 4 years ago

We can revisit this when there is time, no urgency. Writing this for now to save my thoughts to the cloud:

My argument would have gone the other way round: client code knows they want to write using mime type/extension something/whatever so they use (eventually) the generic method save(document, file, "something/whatever") using whatever plugin is responsible for that mime type. Which plugin is responsible depends on e.g. a pipeline that has been loaded earlier, action by the user, or the plugin being loaded either by that client code or some other code (so that client code does not need to know). If the client code really NEEDS complete control, it still can use the API if the required class directly, that would not get abandoned.

I just think that the main use-case where it would be useful to have simple means to load and save documents is one where the correct plugins for that situation are getting loaded by some means earlier anyway. Loading and saving documents from code appears to be extremely complex, using code that is hard to discover and exposing implementation details, but I think 99% of the time a simple generic utility method that takes the URL and mimetime should work.

What I do not understand is: you write that " like with document formats for loading, you can easily have multiple ... " so we may have the same situation when loading but still support using the registered class no? So how is writing different from loading?

greenwoodma commented 4 years ago

What I do not understand is: you write that " like with document formats for loading, you can easily have multiple ... " so we may have the same situation when loading but still support using the registered class no? So how is writing different from loading?

Yes, with loading the situation is even worse, as once you have more than one plugin register for the same mimetype you can only use the last one that was loaded. The earlier ones become inaccessible, which I feel is really horrid.

I think the other problem, which I didn't touch on, is that for most of the document formats (other than those designed for round tripping) parameters on which annotation sets and annotations need to be saved etc. You can't do that via a simple method either, as you still need to know the exact params to set, just like when you create a resource.

As I said I think we should view finding the right exporter as being analogous to creating an instance of a resource rather than loading a document. In fact you could create your own instance of the exporter if you wanted, I just set up the code originally to treat them like tools where one is auto created so that we could support the GUI actions and to reduce the overhead (i.e. one instance not many).