PDFium is not properly picking up on some PDF annotations, instead dropping them which ends up removing content from the document upload.
StackOverflow users suggested two possible fixes:
In PDFium there is an FPDF_ANNOT flag that can be passed to the various FPDF_RenderPage* methods. It's possible the PDFiumViewer code provides the same flag somewhere.
or
doc = PDDocument.load(FilePath);
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(pageNum);
int rotPD = page.findRotation();
PDRectangle pageBound = page.findCropBox();
PDRectangle rect = ModifyRectAccordingToRotation(rectangle, rotPD, pageBound);
PDAnnotationLink txtLink = new PDAnnotationLink();
PDBorderStyleDictionary borderULine = new PDBorderStyleDictionary();
borderULine.setStyle(PDBorderStyleDictionary.STYLE_UNDERLINE);
borderULine.setWidth(0);
txtLink.setBorderStyle(borderULine);
PDActionRemoteGoTo remoteGoto = new PDActionRemoteGoTo();
PDComplexFileSpecification fileDesc = new PDComplexFileSpecification();
fileDesc.setFile(System.IO.Path.GetFileName(path));
remoteGoto.setOpenInNewWindow(true);
remoteGoto.setFile(fileDesc);
txtLink.setAction(remoteGoto);
txtLink.setRectangle(rect);
page.getAnnotations().add(txtLink);
I'm curious if you have any thoughts on how to fix this or if this is something you might look at resolving in the future. I think we're going to print the whole file and re-scan it for now. Thanks a lot.
As long as we’re diving into the world of PDF specs, might be worth revisiting.
Source: https://muckrock.zendesk.com/agent/tickets/37335I uploaded some PDFs that had been filled in and saved. The uploaded versions on Doc Cloud had all of the fillable info removed, so they were blank forms again.
Example doc from user:
This one is interesting because the agency did a poor redaction, and documentcloud just strips that out entirely. User flagged it because they asked why the bad redaction checker didn’t work, and told them it was because DocCloud didn’t see the bad redactions.
Summary of the problem
PDFium is not properly picking up on some PDF annotations, instead dropping them which ends up removing content from the document upload.
StackOverflow users suggested two possible fixes:
or
Steps to reproduce the bug
Upload this document.
[appellatecourtorder.pdf](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/08f287c4-0bce-49f1-914a-e656488e6ae1/appellatecourtorder.pdf)
Compare page two from the original
!https://s3-us-west-2.amazonaws.com/secure.notion-static.com/b81ae403-20e9-4232-974e-ccb17274762d/Screen_Shot_2021-08-05_at_5.00.58_PM.png
With page two from the version now hosted on documentcloud:
!https://s3-us-west-2.amazonaws.com/secure.notion-static.com/99602f6b-dce6-4025-b96a-e6760642a479/Screen_Shot_2021-08-05_at_5.02.17_PM.png
The URL you were on:
https://www.documentcloud.org/documents/20417920-appellatecourtorder
Other Examples:
Example 2
[From Matt DeLong](https://muckrock.zendesk.com/agent/tickets/79767): ([matt.delong@startribune.com](mailto:matt.delong@startribune.com))
As long as we’re diving into the world of PDF specs, might be worth revisiting.
[mcro_62-cv-19-3868_order-other_2022-07-11_20220711121948.pdf](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0e1bbf5a-e63f-401e-a3b7-bb38c1558b96/mcro_62-cv-19-3868_order-other_2022-07-11_20220711121948.pdf)
Example 3
Source: https://muckrock.zendesk.com/agent/tickets/37335 I uploaded some PDFs that had been filled in and saved. The uploaded versions on Doc Cloud had all of the fillable info removed, so they were blank forms again. Example doc from user:
[sample PDF for doccloud (1).pdf](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/47ec9849-02e6-48b5-8e30-26dc1741cfbb/sample_PDF_fordoccloud(1).pdf)
Example 4:
This one is interesting because the agency did a poor redaction, and documentcloud just strips that out entirely. User flagged it because they asked why the bad redaction checker didn’t work, and told them it was because DocCloud didn’t see the bad redactions.