dkpro / dkpro-jwktl

Java Wiktionary Library
http://dkpro.org/dkpro-jwktl/
Apache License 2.0
57 stars 26 forks source link

Extrem memory consumption while iterating over multiple entries. #5

Closed chmeyer closed 9 years ago

chmeyer commented 9 years ago

Originally reported on Google Code with ID 5

Trying to iterate over a whole Wiktionary database, or even a small portion of it, fails
with an OutOfMemoryError (Java heap space). This even occurs for example when just
writing some of the data to the console (so no memory consumption besides the library
usage). It seems like there is some serve Memory leak somewhere either in JWKTL or
in BerkeleyDB.

Of course, a quick fix solution would be to increase the memory available for the application
(using VM options -Xms & -Xmx), but first of all, the OutOfMemoryError occurs even
after only iterating a little amount of a Wiktionary, expecting an inefficiently high
memory need for iterating a whole Wiktionary. Secondly it still looks like some kind
of memory leak, so it should be possible to iterate over the whole Wiktionary without
increasing any heap space. Or is there any possibility in JWKTL to clear cached data
while iterating?

What steps will reproduce the problem?
1. Here is an example test method, that tries to extract all german example sentences
from the german and the english Wiktionary.

    @Test
    public void testGetAllExampleSentences() throws Exception {
        int counter = 0;
        IWiktionary wkt = JWKTL.openCollection(german, english);
        IWiktionaryIterator<IWiktionaryEntry> allEntries = wkt.getAllEntries();
        for (IWiktionaryEntry entry : allEntries) {
            ILanguage language = entry.getWordLanguage();
            if (language != null && language.getName().equals("German")) {
                List<IWikiString> examples = entry.getExamples();
                for (IWikiString example : examples) {
                    String plainText = example.getPlainText();
                    System.out.println(plainText);
                    counter++;
                }
            }
        }
        wkt.close();
        System.out.println(counter);
    }

What is the expected output? What do you see instead?

After 3300 sentences, the output slows down extremely, shortly after throwing the OutOfMemoryError
mentioned above.

What version of the product are you using? On what operating system?

I use the official JWKTL version 1.0.0 from the Central Maven Repository, using this
dependency:
        <dependency>
            <groupId>de.tudarmstadt.ukp.jwktl</groupId>
            <artifactId>jwktl</artifactId>
            <version>1.0.0</version>
        </dependency>

Thank you for your help. Besides that, thank you very much for providing this library.
It helps tremendously! Please continue providing such great libraries and tools.
Best wishes,
Andreas

Reported by andreas.schulz.de on 2013-12-04 11:54:41

chmeyer commented 9 years ago
Don't know exactly the cause. At least, I did similar stuff previously and didn't have
this problem. Try: 

(1) query the English and German Wiktionary editions separately using 

IWiktionaryEdition edition = JWKTL.openEdition..

and check if the error is the same or if there's a problem in the collection code (haven't
used that very often recently).

(2) Set an explicit cache size, e.g., 

wiktionary = JWKTL.openEdition(wiktionaryPath, 500 * 1024 * 1024L);

Sometimes, I have the impression that the BerkeleyDB makes strange things with the
available memory...

Reported by chmeyer.de on 2013-12-04 15:56:53

chmeyer commented 9 years ago
I tried your both approaches in my specific scenario:

(1) Isn't working: when opening two separate Wiktionary via JWKTL.openEdition, instead
of openCollection(...), but switch after each entry, the OutOfMemoryError still can
be provoked. Reading the editions in sequential order works, but doesn't fit my problem.

(2) With 500 MB or even 200 MB the OutOfMemoryError is still occuring. I was apple
to iterate the two Wiktionary's with an ideal value of 50 MB. I tested it with 100
MB, 1MB and even less, and all of that caching sizes where working for me, but actually
the 50 MB configuration is the fastest running in my current setup. [Does anyone have
other experiences or more values to add?]

In short: the problem is still existing, but can be avoided by a sufficient small explicit
cache size value.

Reported by andreas.schulz.de on 2013-12-10 16:44:42