MuckRock / documentcloud-frontend

DocumentCloud's front end source code - Please report bugs, issues and feature requests to info@documentcloud.org
https://www.documentcloud.org
GNU Affero General Public License v3.0
15 stars 5 forks source link

DocumentCloud not properly processing PDF annotations #194

Open eyeseast opened 1 year ago

eyeseast commented 1 year ago

Summary of the problem

PDFium is not properly picking up on some PDF annotations, instead dropping them which ends up removing content from the document upload.

StackOverflow users suggested two possible fixes:

In PDFium there is an FPDF_ANNOT flag that can be passed to the various FPDF_RenderPage* methods. It's possible the PDFiumViewer code provides the same flag somewhere.

or

doc = PDDocument.load(FilePath);
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(pageNum);
int rotPD = page.findRotation();
PDRectangle pageBound = page.findCropBox();
PDRectangle rect = ModifyRectAccordingToRotation(rectangle, rotPD, pageBound);
PDAnnotationLink txtLink = new PDAnnotationLink();
    PDBorderStyleDictionary borderULine = new PDBorderStyleDictionary();
                    borderULine.setStyle(PDBorderStyleDictionary.STYLE_UNDERLINE);
borderULine.setWidth(0);
txtLink.setBorderStyle(borderULine);
PDActionRemoteGoTo remoteGoto = new PDActionRemoteGoTo();
PDComplexFileSpecification fileDesc = new PDComplexFileSpecification();
fileDesc.setFile(System.IO.Path.GetFileName(path));
remoteGoto.setOpenInNewWindow(true);
remoteGoto.setFile(fileDesc);
txtLink.setAction(remoteGoto);
txtLink.setRectangle(rect);
page.getAnnotations().add(txtLink);

Steps to reproduce the bug

Upload this document.

[appellatecourtorder.pdf](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/08f287c4-0bce-49f1-914a-e656488e6ae1/appellatecourtorder.pdf)

Compare page two from the original

!https://s3-us-west-2.amazonaws.com/secure.notion-static.com/b81ae403-20e9-4232-974e-ccb17274762d/Screen_Shot_2021-08-05_at_5.00.58_PM.png

With page two from the version now hosted on documentcloud:

!https://s3-us-west-2.amazonaws.com/secure.notion-static.com/99602f6b-dce6-4025-b96a-e6760642a479/Screen_Shot_2021-08-05_at_5.02.17_PM.png

The URL you were on:

https://www.documentcloud.org/documents/20417920-appellatecourtorder

Other Examples:

Example 2

[From Matt DeLong](https://muckrock.zendesk.com/agent/tickets/79767): ([matt.delong@startribune.com](mailto:matt.delong@startribune.com))

We uploaded a court decision to Document Cloud yesterday that had a watermark and file stamp on every page, and the judge's signature on page 3. I'm guessing that stuff must be added electronically in an additional layer in the PDF, because none of it renders in Document Cloud's viewer. You can see the differences between the documents in [this link](https://s3.documentcloud.org/documents/22083491/mcro_62-cv-19-3868_order-other_2022-07-11_20220711121948.pdf) and [this link](https://www.documentcloud.org/documents/22083491-mcro_62-cv-19-3868_order-other_2022-07-11_20220711121948).

I'm curious if you have any thoughts on how to fix this or if this is something you might look at resolving in the future. I think we're going to print the whole file and re-scan it for now. Thanks a lot.

As long as we’re diving into the world of PDF specs, might be worth revisiting.

[mcro_62-cv-19-3868_order-other_2022-07-11_20220711121948.pdf](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0e1bbf5a-e63f-401e-a3b7-bb38c1558b96/mcro_62-cv-19-3868_order-other_2022-07-11_20220711121948.pdf)

Example 3

Source: https://muckrock.zendesk.com/agent/tickets/37335 I uploaded some PDFs that had been filled in and saved. The uploaded versions on Doc Cloud had all of the fillable info removed, so they were blank forms again. Example doc from user:

[sample PDF for doccloud (1).pdf](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/47ec9849-02e6-48b5-8e30-26dc1741cfbb/sample_PDF_fordoccloud(1).pdf)

Example 4:

This one is interesting because the agency did a poor redaction, and documentcloud just strips that out entirely. User flagged it because they asked why the bad redaction checker didn’t work, and told them it was because DocCloud didn’t see the bad redactions.

eyeseast commented 1 year ago

appellatecourtorder.pdf

eyeseast commented 2 months ago

This looks like it's fixed on the sveltekit branch but I'm going to leave it open for now.