MuckRock / documentcloud-frontend

DocumentCloud's front end source code - Please report bugs, issues and feature requests to info@documentcloud.org
https://www.documentcloud.org
GNU Affero General Public License v3.0
18 stars 5 forks source link

Explore options for handling giant PDFs #844

Open eyeseast opened 3 days ago

allanlasser commented 2 days ago

The one performance benefit of rendering PDF pages as images—instead of downloading the PDF and rendering it with PDF.js—is that very large files can be incrementally downloaded and rendered. In the majority of cases, downloading and rendering the PDF is faster and more performant than rendering an image for each page.

It appears to me that PDF.js supports partial rendering of linearized PDFs by default. In this demo, we can see in the network tab how each page is being independently fetched and rendered with 206 Partial Content responses. Linearizing a PDF requires additional steps during PDF creation (in Acrobat, there's a separate checkbox for "Fast Web View," which is a user-friendly name for linearization).

In a quick examination of DocumentCloud's processing pipeline, it doesn't appear to me that we perform linearization as a step in the process. While QPDF and pikepdf both support linearization, it's an opt-in feature that's disabled by default. It's unclear whether we need any special configuration to the S3 bucket to support partial requests, or if HTTP Range Requests are supported by default.

(Another optimization suggested by PDF.js FAQ is to virtualize large PDFs to reduce memory overhead. It doesn't look like we're virtualizing our ResultsList, so a virtualization pass could be a win across the board. I filed this as a separate issue, #849.)