bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Pass PDF document for extraction in RAM #2

Closed wwaites closed 5 years ago

lpla commented 5 years ago

As far as I know and given the actual master code, to be able to pass PDFs that are in RAM (without writing them in disk) by a future daemon mode or making pdf-extract instantiable using other programs or VMs (like JPype), this function (and the homonimous for batch files for consistency sake) should accept File objects instead of path Strings objects as input parameters.

This change should allow Bitextor to pass to the extract function the WARC payloads (content) detected as PDF as a Java File object. And given that JPype instantiate this function a VM once and keeps it up, it should be way more efficient than starting a VM every time you want to convert a PDF, which is what we are doing right now in Bitextor due the way WARCs are processed.

dionwiggins commented 5 years ago

Hi, I just talk to Philipp on this request. We will look into setting it up as a web service. There is quite a bit of start up overhead, so we want to load the objects and keep them in RAM with a pool of theads. Then the processing can start quickly without the load overhead. I will discuss with Pon about this and work out how to best put it together for performance. But the basic premise is that you can post a PDF and it will return a clean HTML as a response.

wwaites commented 5 years ago

I think a web service would be great thing to have. A good way to make a web service would be to first have a Java API that takes File or Stream objects. That way @lpla can call it (repeatedly, amortising the startup overhead) directly from Python as well, which would be a better fit for the HPC workflow.

dionwiggins commented 5 years ago

Yes. That is the exact intent. It will take in a file object and return a JSON response with the clean HTML. We will be writing the spec over the next day or so for the developer and will then add this implementation. We will likely set up a thread pool that can be configured in size for concurrent processing. The challenge will be to make sure the server it is on does not run out of RAM. If the PDFs submitted have large images, that may cause some issues. While the tool will work fine, my concern is running out of RAM. So we may need to put some sort of safeguard that monitors RAM before accepting new requests. For PDFs with text embedded, this is unlikely to be an issue, but some of the big PDFs that I have seen can be huge due to images. One possible solution is that on the calling side, images could be removed from the PDF before submitting. We can look at a function for that also.

Something like the below could run on the client before the call and then the documents would be much lighter. Other processes for PDF may want to consider running the same to reduce memory and processing overhead also.

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;

public class Main {
    public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save("strippedOfImages.pdf");
    }
}
wwaites commented 5 years ago

Excellent. And good idea removing the images -- I noticed this when I passed my PhD thesis through it (which also seems to trigger a bug #3). But why return HTML in JSON, why not just return HTML?

dionwiggins commented 5 years ago

We could return raw HTML, but I was thinking of returning more information than just the HTML. For example, the success status and perhaps some additional performance metadata. If JSON returned, it could include the HTML in one of the JSON fields and still have the flexibility to return more details. If the response is HTML only, then it is limited to only the HTML body.

dionwiggins commented 5 years ago

We are implementing the above image removal code as a stream or file for input/output so that you can call this in the workflow on the client side to remove images early and reduce memory utilization. This will result in less memory used in all related processes and much faster transmission between the client and the PDFExtract web service. We are also looking at what else might be stripped that is useless such as watermarks, although we may just stick to doing that on the web service side.

dionwiggins commented 5 years ago

Resolved. Streaming is now supported. We have it streaming in a Java wrapper test app and are trying to test with JPype, but are not so familiar with Python. Once working, will add sample code for JPype. But the streaming version has now been committed. See docs for details.

lpla commented 5 years ago

I will start working on it. I am working on a Python3 wrapper: https://github.com/bitextor/python-pdfextract

dionwiggins commented 5 years ago

Excellent. Thanks Leo.

Regards,

Dion Wiggins Founder and CTO Omniscien Technologies

Phone: +66 (8) 7086 3353 Fax: +66 (2) 662 4728, +66 (2) 662 4727 Skype: dionwiggins Email: dion.wiggins@omniscien.com Web: http://www.omniscien.com

NOTICE: This e-mail (including all information transmitted with it) is for the intended addressee only. It may contain information that is confidential, proprietary and/or legally privileged. No confidentiality, ownership right or privilege is waived or lost by any mistransmission, redirection or interception. No one other than the intended addressee may read, print, store, copy, forward or act in reliance upon this e-mail. If you are not the intended addressee: (a) any use, dissemination, printing or copying of this e-mail is strictly prohibited and may be a breach of confidence, and (b) kindly notify the sender by e-mail immediately and delete and destroy all copies of this e-mail in your possession.

From: Leo notifications@github.com Sent: Tuesday, July 2, 2019 10:12 PM To: bitextor/pdf-extract pdf-extract@noreply.github.com Cc: dionwiggins dion.wiggins@omniscien.com; State change state_change@noreply.github.com Subject: Re: [bitextor/pdf-extract] Pass PDF document for extraction in RAM (#2)

I will start working on it. I am working on a Python3 wrapper: https://github.com/bitextor/python-pdfextract — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

lpla commented 5 years ago

Would it be possible to replace the "InputStream" class from the "extract" function with a "ByteArrayInputStream"? Looks like I can't make JPype work without this change because JPype can't extend classes, so I can't use/extend a ByteArrayInputStream (what I have in memory in Python) as an InputStream in JVM: https://jpype.readthedocs.io/en/latest/quickguide.html#implements-and-extension

lpla commented 5 years ago

Also, I guess that OutputStream is also needed as ByteArrayOutputStream instead.

dionwiggins commented 5 years ago

Noted. We will make adjustments ASAP

Get Outlook for Android

On Wed, Jul 3, 2019 at 7:51 PM +0700, "Leo" notifications@github.com wrote:

Also, I guess that OutputStream is also needed as ByteArrayOutputStream instead.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

dionwiggins commented 5 years ago

Change made as requested. Please try again.

lpla commented 5 years ago

python-pdfextract ready and integrated in bitextor. Doing some tests right now to check performance. Thanks!

dionwiggins commented 5 years ago

Great. There is certainly performance issues to address. But functionally it's working well based on our testing. We will work on increasing speed soon. 

Get Outlook for Android

On Fri, Jul 5, 2019 at 5:20 PM +0700, "Leo" notifications@github.com wrote:

python-pdfextract ready and integrated in bitextor. Doing some tests right now to check performance. Thanks!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.