TomRoush / PdfBox-Android

The Apache PdfBox project ported to work on Android
Apache License 2.0
991 stars 257 forks source link

PDF document load using PDDocument.load takes longer time #257

Open huntext17 opened 3 years ago

huntext17 commented 3 years ago

Hello there,

I am using PDF box in my android app to extract text from PDF documents and works really well for most of the documents especially when the size is small and doesn't have lot of content. But the PDDocument.load method takes like 2 -4 mins for PDF files of 10 MB or more on a device with reasonably good hardware configuration like Note 10 which has 8 GB and Octa-core processor. On other devices with low memory and processor config., it just fails.

This is my code snippet. TCS Second Interim Dividend for 2009-10 - Copy (10).pdf

document = PDDocument.load(file); PDFTextStripper pdfStripper = new PDFTextStripper(); text = pdfStripper.getText(document);

Attached the pdf which is failing.

This functionality is very critical and currently one client is waiting for the fix. Can someone please help me on this?

huntext17 commented 3 years ago

Also seeing this in logs.. It is going in loop, runs out of memory and finally the app crashes, Please let me know if I am doing something wrong here. Any help is greatly appreciated!

tpdf D/PdfBox-Android: parsed=COSObject{265261, 0} 2021-03-27 23:32:05.212 28838-28838/com.msystems.testpdf D/PdfBox-Android: parsed=COSObject{265262, 0} 2021-03-27 23:32:05.212 28838-28838/com.msystems.testpdf D/PdfBox-Android: parsed=COSObject{265263, 0} 2021-03-27 23:32:05.212 28838-28838/com.msystems.testpdf D/PdfBox-Android: parsed=COSObject{265264, 0} 2021-03-27 23:32:05.212 28838-28838/com.msystems.testpdf I/systems.testpd: Waiting for a blocking GC Alloc 2021-03-27 23:32:05.251 28838-28838/com.msystems.testpdf I/systems.testpd: WaitForGcToComplete blocked Alloc on HeapTrim for 38.534ms

TomRoush commented 3 years ago

There's a lot of pages in this PDF which is why it is using so much memory. I was able to reproduce the issue on a lower end device. I will see if there's any memory leaks that can be closed or ways to make the memory usage more efficient, but if possible the quickest solution would be to use a PDF with less pages

huntext17 commented 3 years ago

Thank you so much @TomRoush for taking this up. Really appreciate it! Please let me know if I can be involved in something or I can contribute in anyways. I would be more than happy to do that. So, we wouldn't know what PDFs the end users are using. and we have to support all PDFs.

TomRoush commented 3 years ago

Sharing problem PDFs is always appreciated, as long as you have permission to do so. If you're able to narrow down the what's using memory that would also be helpful

Apsalogics commented 2 years ago

by Using BackgroundTask this problem is solve because loading is don on background. i am using This to solve my problem