GateNLP / gate-core

The GATE Embedded core API and GATE Developer application
GNU Lesser General Public License v3.0
75 stars 29 forks source link

Severe performance degredation when loading GATE xml documents #122

Closed johann-petrak closed 3 years ago

johann-petrak commented 4 years ago

Loading (in the GUI) a corpus of 2015 documents (about 1.4G) with many annotations and features after starting a fresh GATE with -Xmx20G -Xms20G -tmp:

Testing if this may be related to the changed xstream version used:

So it appears this is somehow related to the xstream version we use.

ianroberts commented 4 years ago

We definitely can't roll back to 1.4.7 as that dates back to 2014 and there are several serious CVEs that have been fixed since then relating to exposure of filesystem data and arbitrary code execution.

greenwoodma commented 4 years ago

My guess is that this is probably down to the extra security checks xstream does to prevent code execution etc. (if you look at the change log it's mostly just new check after new check) and may just be something we have to swallow.

johann-petrak commented 4 years ago

Yes, was thinking the same, but once I have more info I might check if somebody opened an issue about that in the xstream repo, and if not, open one. A bigger than 5-fold slowdown is really quite annoying and maybe not really necessary, even with all those checks :)

greenwoodma commented 4 years ago

It may be worth looking to see if any of the xstream dependencies themselves have changed version as it might not be a change in xstream itself. I have a feeling I had to mess with another related dependency recently.

johann-petrak commented 4 years ago

We made a big jump from 1.4.7 to 1.4.11.1 so I also checked with other xstream versions:

So there was some gradual slowdown to version 1.4.9, but the severe slowdown happened from version 1.4.9 to 1.4.10.

greenwoodma commented 4 years ago

Interestingly 1.4.10 includes a fix for a performance issue: https://github.com/x-stream/xstream/issues/61

Maybe there is a related, but unfixed, issue. Might be a good starting point if you want to go digging about.

johann-petrak commented 4 years ago

I have created: https://github.com/x-stream/xstream/issues/200 The answer points to the FAQ http://x-stream.github.io/faq.html#Scalability_Performance and when time allows we can have a closer look into how we actually do this in detail, especially the advice of keeping an initialized instance around which could be very important for the population task.

greenwoodma commented 4 years ago

Cool. A singleton instance might make configuring the security side easier as well; it should certainly make it easier for us to expose an API for gate users to further configure the xstream security.

johann-petrak commented 4 years ago

Yes, had been thinking the same, especially since apparently they want to completely prevent the use without the security handling in place from version 1.5. I think that for all our loading and saving purposes, we do not even need to worry about multithreading because (de)serialization should always happen outside of any duplication (out of my memory I do not know about GCP handlers though)

greenwoodma commented 3 years ago

@johann-petrak I've switched to using static instances of XStream which should fix this I hope. Do you still have the dataset you used last time to see if this has helped?

johann-petrak commented 3 years ago

Cool, I will try to dig this up or re-run a new benchmark with both versions to check this as soon as I find the time!

johann-petrak commented 3 years ago

OK I did re-run this, timing the loading of all documents:

I would call that fixed!