Closed johann-petrak closed 5 years ago
To be fair, while this might be a bug, we usually tell people that if they have more than a few tens of documents then they should be putting them into a datastore and not simply loading them all into a normal corpus. I'd certainly never recommend having more than 100 documents (even tiny ones) open at once. If nothing else it makes moving around the resource tree a nightmare.
By the way @johann-petrak is the corpus checked in somewhere if I want to try and replicate the issue?
Yes, to reproduce:
So I can reproduce this, although on my machine it does eventually (about four minutes) recover and all the documents are closed. The problem is that for each document a new thread is created to run the close action on the EDT which also causes the creation of some other internal AWT classes etc. So not only do you have, in your example, the overhead of creating the objects associated with 12,543 documents (at least three instances per document) but the threads are created in a loop and dispatched to the EDT using SwingUtilities.invokeLater
meaning that they all get quickly pushed on to the EDT and I'm guessing that means normal drawing events etc. get delayed which causes the GUI to appear locked.
I've got a couple of ideas of ways this might be changed but in essence, my original comment stands..... don't open this many documents at once and expect the GUI to behave properly, it won't.
I do not really have strong feelings on this, just saying that modern computers have easily 16G of RAM and loading a few ten thousand sentences should not be an issue - I am loading a few dozen million into python/pandas just into ram. The other thing is that this is the only way really to avoid using datastores and datastores are just terrible -- whenever there is an exception in a pipeline, some document is damaged and the datastore essentially gets unusable. And there is currently no standard alternative to do directory to directory prorcessing from the GATE GUI.
Yes, it's not a RAM issue, more it's an issue with the GUI and Swing. There is no problem with doing what you are from the API where you can happily load the 12,543 docs into an in-memory corpus, but expecting Swing to handle that is a different matter entirely.
I've just pushed a fix which should allow the GUI to stay responsive, and shouldn't require quite such a large amount of RAM to close the docs; it still eats some RAM as the swing classes etc. are created as we close the docs and update the tree which is CPU intensive and so they aren't instantly GC'd. So it doesn't truly fix the problem but it should hide it away to a certain extent.
@johann-petrak could you try it and check it seems more responsive to you?
Brilliant, yes, the GUI now interacts while going about removing the documents. Loading the corpus takes 80 seconds on my machine and closing those documents then takes 350 seconds.
This refers to GATE 8.5.1 and 8.6-SNAPSHOT.
I have populated a corpus with 12543 tiny documents (each one just one sentence and just token and sentence annotations). Populating the corpus takes a bit less than 3 minutes on my system.
But when I mark all those documents and then choose "Close", the GUI is still unresponsive with nothing changed after 30 minutes at which point I killed the process.
When the Corpus view is active in the window, which initially shows the first page of the 12543 documents, then this view gets cleared after just one or two minutes, but the GUI stays unresponsive and the list of documents in the resources tree stays unchanged with the documents still marked.
When checking the heap memory for the process, the used memory slowly increases, the GATE process uses up > 100% CPU on my dual core desktop. When forcing a GC, there is a significant decrease of used heap memory after a short while, but after this, the memory keeps increasing at the same rate and a GC brings us back to exactly the same amount each time.