Open coder-sa opened 6 years ago
Interesting discussion here: https://lists.apache.org/thread.html/a2faaab8199d6abad443beddcdc421c34253045c93e7869aa7ad449d@%3Cuser.tika.apache.org%3E
Where it seems that upgrading to Tika 1.18 from 1.16 is causing some memory leaks.
interesting!! I was using FSCrawler 2.4 version for these tests - doesn't that comes with Tika 1.16?
Thanks @coder-sa. I thought you tried 2.5-SNAPSHOT.
You're right 2.4 comes with Tika 1.16: https://github.com/dadoonet/fscrawler/blob/fscrawler-2.4/pom.xml#L28
I'm going to tell that to the Tika team.
BTW I'm not sure if I did anything in the meantime in 2.5 that would "fix" the problem, but do you have a chance to test it with 2.5?
sure, will run the same tests with 2.5 but it may take a while to run a all tests again. I will post the results as soon as i am done with testing.
@coder-sa Did you run new tests by any chance? If not, would you like to use 2.6-SNAPSHOT for that and update?
No, i haven't run any further tests. I will take a look at 2.6 snapshot and let you know but it may take some time before i can share results.
Thanks Sachin
On Fri, Dec 14, 2018 at 1:20 PM David Pilato notifications@github.com wrote:
@coder-sa https://github.com/coder-sa Did you run new tests by any chance? If not, would you like to use 2.6-SNAPSHOT for that and update?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dadoonet/fscrawler/issues/566#issuecomment-447243714, or mute the thread https://github.com/notifications/unsubscribe-auth/Am_fPs3I72R__LV-mmIzszErcwhZN7WYks5u41gzgaJpZM4VfxOj .
I changed the label of this issue as with 2.7, so many things changed in the meantime that we would need to run some tests again.
Every time I run a new environment, this memory problem will appear after running it for a period of time.
[739094.503s][warning][gc,alloc] grizzly-http-server-12: Retried waiting for GCLocker too often allocating 1931533 words
15:48:08,106 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [44743ac843d5c95aa2bd05efeba7877f.pdf]: Java heap space
[739991.826s][warning][gc,alloc] grizzly-http-server-28: Retried waiting for GCLocker too often allocating 5355002 words
16:03:05,428 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [937ed647ed305dd33e25023261dd279a.pdf]: Java heap space
16:16:10,424 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [8b8a0e88081322524b568bb68014f43f.pdf]: Java heap space
[742082.148s][warning][gc,alloc] grizzly-http-server-5: Retried waiting for GCLocker too often allocating 5803787 words
16:37:55,750 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [de9b70c5feadb3e76eacceb319f8027b.pdf]: Java heap space
[742565.802s][warning][gc,alloc] grizzly-http-server-38: Retried waiting for GCLocker too often allocating 3128787 words
16:45:59,404 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [593afa293a246b1a0a8a9f8e490d9ae7.pdf]: Java heap space
[742920.529s][warning][gc,alloc] grizzly-http-server-47: Retried waiting for GCLocker too often allocating 3298157 words
16:51:54,135 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [13c100669462038857c000b53ced1332.pdf]: Java heap space
[744036.148s][warning][gc,alloc] grizzly-http-server-17: Retried waiting for GCLocker too often allocating 7486759 words
17:10:29,750 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [8b8a0e88081322524b568bb68014f43f.pdf]: Java heap space
My guess is that the memory is not released every time a file is uploaded, causing the memory to become full after a while.
2.10-SNAPSHOT version
What value did you set for the memory? Could you share the very first lines of the fscrawler logs?
While performing sizing testing to check how big a file can be ingested, it was noticed that anything above 10MB file size does not goes through. Even if ingestion into ElasticSearch goes fine for 100MB, searching within that file will either slowdown UI or you will face time out issues. Below is the file size test results that were done.
This requires some optimization so that large files should get ingested without any performance anomaly.
EDIT (by @dadoonet):
To generate such big files, you can use the following script:
Source for this issue: https://discuss.elastic.co/t/default-value-http-content-content-length-does-not-restricts-ingestion-of-large-documents/138853/