KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Memory Leak on extracting text from files #138

Open raghav-axero opened 4 years ago

raghav-axero commented 4 years ago

We noticed that every time we extract the text from TikaOnDotNet there is memory leak after the text has been extracted:

The code is simple as given in your samples:

new TextExtractor().Extract(filePhysicalPath);

Already using the latest Dlls:

TikaOnDotNet.dll (version 1.17.1.0)

image

TikaOnDotNet.TextExtraction.dll (version 1.17)

image

IIS version on we are testing: 10 image

Target Framework: 4.7.2 image

Memory leak detection by ANTS Profiler:

The first is the base when we didn't start any extraction, second is the one which we took after the extraction has been completed.

The second one is confirming that memory increased and stayed there even after the extraction has been completed.

image

image

image

You can see from the above screenshots that "LinkedHashMap + Entry" live objects from "java.util" are still there in the memory even after the extraction has already been completed.

I am attaching the PDF with which you can try the above test: PDF: 200 MB size https://drive.google.com/file/d/1DWdWfkHebS9aLpqiLAbaRwwiSamGw8Ym/view?usp=sharing

EDIT:

If I use the following code before and after Tika extraction, the memory comes back to normal levels:

               // Force GC to handle memory leak via Tika
                GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
                GC.WaitForPendingFinalizers();