dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.36k stars 299 forks source link

Ingestion of more than 10MB single file ingestion fails #566

Open coder-sa opened 6 years ago

coder-sa commented 6 years ago

While performing sizing testing to check how big a file can be ingested, it was noticed that anything above 10MB file size does not goes through. Even if ingestion into ElasticSearch goes fine for 100MB, searching within that file will either slowdown UI or you will face time out issues. Below is the file size test results that were done.

Test Case# File Size FSCrawler - Java Heap Result Comments
1 500MB 512MB Failed Failed with "Out of Memory"
2 500MB 1GB Failed Failed with "Out of Memory"
3 500MB 1.5GB Failed Failed with "Out of Memory"
4 500MB 2GB Failed Failed with "Out of Memory"
5 250MB 512MB Failed Failed with "Out of Memory"
6 250MB 1GB Failed Failed with "Out of Memory"
7 250MB 1.5GB Failed Failed with "Out of Memory"
8 250MB 2GB Failed Failed with "Out of Memory"
9 100MB 512MB Failed Failed with "Out of Memory"
10 100MB 1GB Failed Failed with "Out of Memory"
11 100MB 1.5GB Failed Failed with "Out of Memory"
12 100MB 2GB Success file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
13 80MB 512MB Failed Failed with "Out of Memory"
14 80MB 1GB Failed Failed with "Out of Memory"
15 80MB 1.5GB Success file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
16 80MB 2GB Success file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
17 50MB 512MB Failed Failed with "Out of Memory"
18 50MB 1GB Success viewing through Kibana UI (via Discovery) is still time consuming - it took a minute before anything showed up on the UI
19 25MB 512MB Success Slowness observed in searching and when pulling content via Discover tab on Kibana UI (took about 30-40secs)
20 10MB 512MB Success slight slowness still observed. Took about 15-20secs searching content in the file

This requires some optimization so that large files should get ingested without any performance anomaly.

EDIT (by @dadoonet):

To generate such big files, you can use the following script:

echo "This is just a sample line appended to create a big file.. " > dummy.txt
for /L %i in (1,1,14) do type dummy.txt >> dummy.txt

Source for this issue: https://discuss.elastic.co/t/default-value-http-content-content-length-does-not-restricts-ingestion-of-large-documents/138853/

dadoonet commented 6 years ago

Interesting discussion here: https://lists.apache.org/thread.html/a2faaab8199d6abad443beddcdc421c34253045c93e7869aa7ad449d@%3Cuser.tika.apache.org%3E

Where it seems that upgrading to Tika 1.18 from 1.16 is causing some memory leaks.

coder-sa commented 6 years ago

interesting!! I was using FSCrawler 2.4 version for these tests - doesn't that comes with Tika 1.16?

dadoonet commented 6 years ago

Thanks @coder-sa. I thought you tried 2.5-SNAPSHOT.

You're right 2.4 comes with Tika 1.16: https://github.com/dadoonet/fscrawler/blob/fscrawler-2.4/pom.xml#L28

I'm going to tell that to the Tika team.

BTW I'm not sure if I did anything in the meantime in 2.5 that would "fix" the problem, but do you have a chance to test it with 2.5?

coder-sa commented 6 years ago

sure, will run the same tests with 2.5 but it may take a while to run a all tests again. I will post the results as soon as i am done with testing.

dadoonet commented 5 years ago

@coder-sa Did you run new tests by any chance? If not, would you like to use 2.6-SNAPSHOT for that and update?

coder-sa commented 5 years ago

No, i haven't run any further tests. I will take a look at 2.6 snapshot and let you know but it may take some time before i can share results.

Thanks Sachin

On Fri, Dec 14, 2018 at 1:20 PM David Pilato notifications@github.com wrote:

@coder-sa https://github.com/coder-sa Did you run new tests by any chance? If not, would you like to use 2.6-SNAPSHOT for that and update?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dadoonet/fscrawler/issues/566#issuecomment-447243714, or mute the thread https://github.com/notifications/unsubscribe-auth/Am_fPs3I72R__LV-mmIzszErcwhZN7WYks5u41gzgaJpZM4VfxOj .

dadoonet commented 3 years ago

I changed the label of this issue as with 2.7, so many things changed in the meantime that we would need to run some tests again.

muxiaobai commented 4 months ago

Every time I run a new environment, this memory problem will appear after running it for a period of time.

[739094.503s][warning][gc,alloc] grizzly-http-server-12: Retried waiting for GCLocker too often allocating 1931533 words
15:48:08,106 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [44743ac843d5c95aa2bd05efeba7877f.pdf]: Java heap space
[739991.826s][warning][gc,alloc] grizzly-http-server-28: Retried waiting for GCLocker too often allocating 5355002 words
16:03:05,428 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [937ed647ed305dd33e25023261dd279a.pdf]: Java heap space
16:16:10,424 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [8b8a0e88081322524b568bb68014f43f.pdf]: Java heap space
[742082.148s][warning][gc,alloc] grizzly-http-server-5: Retried waiting for GCLocker too often allocating 5803787 words
16:37:55,750 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [de9b70c5feadb3e76eacceb319f8027b.pdf]: Java heap space
[742565.802s][warning][gc,alloc] grizzly-http-server-38: Retried waiting for GCLocker too often allocating 3128787 words
16:45:59,404 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [593afa293a246b1a0a8a9f8e490d9ae7.pdf]: Java heap space
[742920.529s][warning][gc,alloc] grizzly-http-server-47: Retried waiting for GCLocker too often allocating 3298157 words
16:51:54,135 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [13c100669462038857c000b53ced1332.pdf]: Java heap space
[744036.148s][warning][gc,alloc] grizzly-http-server-17: Retried waiting for GCLocker too often allocating 7486759 words
17:10:29,750 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [10000] characters of text for [8b8a0e88081322524b568bb68014f43f.pdf]: Java heap space

My guess is that the memory is not released every time a file is uploaded, causing the memory to become full after a while.

muxiaobai commented 4 months ago

2.10-SNAPSHOT version

dadoonet commented 4 months ago

What value did you set for the memory? Could you share the very first lines of the fscrawler logs?