cantaloupe-project / cantaloupe

High-performance dynamic image server in Java
https://cantaloupe-project.github.io/
Other
265 stars 107 forks source link

PdfBox Processor memory issue #198

Open FPaquette opened 6 years ago

FPaquette commented 6 years ago

Hi,

I’m having issues with some of our PDF using PdfBox. For reasons that I’m trying to figure out, they are taking quite a lot of memory, some directly from the server itself. This has for effect that after multiple requests to Cantaloupe, I’m getting a ‘out of memory’. Either the OOM killer is triggered by the server or Cantaloupe just crash. The only way to release the memory to the server is to restart the application server.

Here’s an example of a PDF very problematic.

https://www.dropbox.com/s/er4oblu57bigzhk/97521.pdf?dl=0

Looking closer what is going on memory wise, I’m seeing those PDF filled up the young generation. So I added more memory to the server, the JVM can use up to 30 G, which help but didn’t solve the problem. I then started tuning the garbage collector (the G1 seems to be the better choice for the situation), which also helped but the problem is still there.

I was wondering if you have seen a similar problem before?

Thanks

adolski commented 6 years ago

Hi @FrancoisPaquette,

When I try that PDF using PdfBoxProcessor in 3.4.1, all appears normal (although this is in standalone mode with Java 9):

http://localhost:8182/iiif/2/github-198%2F97521.pdf/full/full/0/default.jpg

11:02:32.838 [main] INFO  e.i.l.c.ApplicationContextListener - Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 9.0.1 / mixed mode
11:02:32.842 [main] INFO  e.i.l.c.ApplicationContextListener - Java home: /Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk/Contents/Home
11:02:32.844 [main] INFO  e.i.l.c.ApplicationContextListener - 8 available processor cores
11:02:32.844 [main] INFO  e.i.l.c.ApplicationContextListener - Heap total: 256MB; max: 4096MB
11:02:32.848 [main] INFO  e.i.l.c.ApplicationContextListener - Effective temp directory: /Users/alexd/NoBackup/Data/cantaloupe/tmp
11:02:32.850 [main] INFO  e.i.l.c.ApplicationContextListener - 🍈 Starting Cantaloupe 3.4.1
11:02:32.863 [main] DEBUG e.i.l.c.ApplicationContextListener - The delegate script is disabled (delegate_script.enabled = false)

,,,

11:03:06.685 [qtp1277009227-18] INFO  e.i.l.c.p.ProcessorConnector - FileResolver -> FileProcessor connection between FilesystemResolver and PdfBoxProcessor
11:03:06.909 [qtp1277009227-18] DEBUG e.i.l.c.c.InfoService - readInfo(): read github-198/97521.pdf from PdfBoxProcessor in 223 msec
11:03:06.910 [qtp1277009227-18] DEBUG e.i.l.c.c.InfoService - getOrReadInfo(): read info of github-198/97521.pdf from PdfBoxProcessor in 224 msec
11:03:06.912 [cl-lp-0095140562] DEBUG e.i.l.c.c.InfoService - putInObjectCache(): ObjectCache is disabled; doing nothing
11:03:06.914 [qtp1277009227-18] DEBUG e.i.l.c.a.Authorizer - Delegate script is disabled; allowing.
11:03:06.918 [qtp1277009227-18] DEBUG e.i.l.c.o.OperationList - applyNonEndpointMutations(): redactions are disabled; skipping.
11:03:06.923 [qtp1277009227-18] DEBUG e.i.l.c.o.OperationList - applyNonEndpointMutations(): delegate script is disabled; skipping overlay.
11:03:06.928 [qtp1277009227-18] DEBUG e.i.l.c.r.AbstractResource - Public identifier requested: github-198%2F97521.pdf -> decoded: github-198/97521.pdf -> slashes substituted: github-198/97521.pdf | X-IIIF-ID header: null
11:03:06.929 [qtp1277009227-18] DEBUG e.i.l.c.r.AbstractResource - Base URI assembled from request: http://localhost:8182
11:03:06.945 [qtp1277009227-18] DEBUG e.i.l.c.r.ImageRepresentation - Derivative cache not available; writing directly to the response
11:03:06.948 [qtp1277009227-18] DEBUG e.i.l.c.p.PdfBoxProcessor - DPI: 150.0
11:03:08.025 [qtp1277009227-18] DEBUG e.i.l.c.p.i.JPEGImageWriter - ImageIO plugin preferences: none
11:03:08.026 [qtp1277009227-18] DEBUG e.i.l.c.p.i.JPEGImageWriter - Using com.sun.imageio.plugins.jpeg.JPEGImageWriter
11:03:08.027 [qtp1277009227-18] DEBUG e.i.l.c.p.i.JPEGImageWriter - Quality: 80; progressive: true
11:03:08.310 [qtp1277009227-18] DEBUG e.i.l.c.r.ImageRepresentation - PdfBoxProcessor processed in 1364 msec: github-198/97521.pdf_encode:jpg_UNDEFINED_80_interlace_#000000_null.jpg
FPaquette commented 6 years ago

There is indeed a difference using the standalone mode, it uses less memory and it looks like fewer processes are involved also.

If I request the images through the web browser it works fine. What I forgot to tell is that I’m getting the memory issue when I send the requests with a small program made to pre-generate thumbnails. Is there maybe something to consider when sending request to Cantaloupe?

Also, out of curiosity, I tried using ImageMagick and it’s not better. For some reason, it looks like the connexion closes when I start getting the response.

Thanks

adolski commented 6 years ago

I wonder if the pre-generation program is failing to close its connections? Especially if other processors are having the same issue, and the same requests work in a web browser.

I'm only guessing, but if connections don't get closed, it's possible that stuff doesn't get garbage-collected...

FPaquette commented 6 years ago

I tested the pre-generation using Jmeter with the same PDF files and I'm seeing the same issue although it takes longer before it happens. I’m not seeing any connections left open in Jmeter.

About the web browser, I wasn't seeing clearly last Friday ; the issue is there also. Sorry about that.

Thanks

adolski commented 6 years ago

I created a few hundred PDFs, about 1.5MB each, and wrote a simple script to request them all in sequence:

#!/bin/bash
for i in {1..500}
do
  curl http://localhost:8182/iiif/2/page-$i.pdf/full/full/0/default.jpg > /dev/null
done

Here's what 10 minutes of jconsole monitoring looks like for me with Cantaloupe 3.4.1, PdfBoxProcessor, JDK 9.0.1, standalone mode, processor.dpi = 150, -Xmx=512m:

results

Looks pretty normal to me. I also tried with copies of the PDF you shared with similar results.

One thing that could cause different results is if there were a large number parallel requests... but, to fill up a 30GB heap would require a huge number of parallel requests, and/or some huge images.

FPaquette commented 6 years ago

Hmmm. I'm not running the script from the server itself but from my computer. I'll give it a try from the server and see what happens, I'll get back to you.

Thanks a lot

FPaquette commented 6 years ago

Not much difference running it from the server as I was expecting. I was able to crash it in about 45 minutes. If you look at the column 'Err', you'll see the number of errors growing.

image

You can see how much memory was used.

image

That test was done using the version 3.4 in a Tomcat with JDK 8.0.91

We are getting those PDF from many different government agencies so we have not control over them and how they are created. There is probably something weird in some of them causing the issue.

If you feel like it, I could you send you a bunch of them? If not, that's totally fine :)

Thanks

adolski commented 6 years ago

Yes, send me some PDFs!

Also I'm interested in the script you're using, if it's different enough from my little bash script...

FPaquette commented 6 years ago

Just to let know that I'll bet out of the office next week, I'll get back to in 2 weeks.

FPaquette commented 6 years ago

Hi,

In the zip file you will have 100 PDF, I could send you more if needed. But I think it should be enough if you do not turn on the cache and use them in a loop until to see the issue. I needed about 800 the last time.

https://www.dropbox.com/s/f9ehokdakqt842o/PDF.rar?dl=0

Also, you will see the file ‘thumbnail.rb, which is the script that I have made to send the request to the server to generate the thumbnails (it was done using the version 2.4.2 of Ruby). Feel free to modify or improve as you wish but basically you will only have to change the name of my server by yours.

I do not know if this script could be useful to other people but I would gladly share it if it is. If you think it could be, you could place it on your Github. It’s up to you.

I also have a Jmeter test file that I created to test rapidly your hypothesis about the connection that I could send you.

You will also see the file ‘ids.txt’ which contains the IDs of all the PDF. The script will read those IDs and use them to send the request to the server.

That’s about it, let me know if you have any question.

Thanks!

adolski commented 6 years ago

Sorry for the delay -- I think there are a couple of things going on here:

  1. Some of the PDFs don't load at all, or if they do, they take many minutes:

    • A334vu0TFjpSVeK4_NgZNw
    • jkyvWTxIhsUU8kdjUxkn7g
    • 6Z6waZdFhzMnStw-iesLLw
    • DHgbqXNR6DTZh_XD1qWDQQ

    And many more. PDFBox is using lots of CPU & memory trying to load them, though. Likely a PDFBox issue, because other readers I've tried can read them OK.

  2. In the thumbnails.rb script, Net::HTTP is timing out. I don't know why, but in order to get it to respect the read timeout, I had to change this:

    Net::HTTP.start(uri.host, uri.port, :read_timeout => 30) do |http|
      response = Net::HTTP.get_response(uri)
      return response
    end

    to this:

    http = Net::HTTP.new(uri.host, uri.port)
    http.read_timeout = 9999
    response = http.get(uri.path)
    return response

    The problem with a short read timeout is that when the client disconnects while the server is still processing, PDFBox doesn't actually stop loading, so the server can get overwhelmed with concurrent requests. (Note that this applies only to PdfBoxProcessor; all others will stop.) I don't think the PDFBox API even offers a way to cancel a load.

    But if you increase the read timeout, then the script will hang when it encounters one of the PDFs above that PDFBox can't load. So, you are out of luck either way.

Can I share one of these non-loading PDFs with the PDFBox developers?

FPaquette commented 6 years ago

No worries, thanks for looking into it.

Sure, you can share a PDF with the PDFBOx developers. Let me know the Jira number, I could maybe help out.

Regarding the read timeout, I was in fact using it to keep the connection open. Keeping it open for 300 seconds was enough to be able to process most of the PDFs. I didn't look at the value in the library but I think it's around 60 seconds or so. I think I just remove a 0, thus the 30 you are seeing, because I didn't want to keep it open longer anymore, it was to toubleshoot.

I may be wrong but I did a few tests with ImageMagick and I believe the problem was the same. For some reason the processor is not working anymore on my test server, I'll looking into it (I had a bit of problem making it work) and test it again.

FPaquette commented 6 years ago

I did a few tests to see if your last commit has any positive effect and I'm not seeing any difference, just to let you know.

ImageMagick is definitely working a lot better as it indeed stop loading if the connexion closes as you were saying. Some of the PDF are detected as PNG by IM, it probably doesn't help PDFBox.

adolski commented 6 years ago

Sorry, this fell off my radar. Here is my support request: https://lists.apache.org/thread.html/1f05908dba47cbb2b60c45ec3e433855e073394cabdd85b9c4b6bcac@%3Cusers.pdfbox.apache.org%3E

FPaquette commented 6 years ago

No worries. Let me know if you need more PDFs.

cmhdave commented 4 years ago

Not sure if it's related but we are seeing issues with PDFs as well. Not nearly at this level but we do see blocking threads so I dug a little deeper. It looks as though PDFBox is calling a synchronized method that is blocking our processing. We have one PDF page that we're trying to render that takes 6 seconds. For 4 of those seconds it's blocked. Here is the thread dump of the block:

org.apache.pdfbox.pdmodel.graphics.color.PDDeviceCMYK.toRGBImageAWT(PDDeviceCMYK.java:178)
java.awt.color.ICC_ColorSpace.toCIEXYZ(ICC_ColorSpace.java:387)
sun.java2d.cmm.lcms.LCMSTransform.colorConvert(LCMSTransform.java:624)
sun.java2d.cmm.lcms.LCMSTransform.doTransform(LCMSTransform.java:135)

The LCMSTransform.doTransform is the synchronized method.

This is with this set in PDFBoxProcessor

System.setProperty("org.apache.pdfbox.rendering.UsePureJavaCMYKConversion", "true");

I found a StackOverflow article that addresses a very similar issue: https://stackoverflow.com/a/35264713/1258283

I'm still looking but I can see in our thread dump for a single PDF image PDFBox ends up calling that synchronized method 7 different times, several of which become blocked.