danfickle / openhtmltopdf

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
https://danfickle.github.io/pdf-templates/index.html
Other
1.89k stars 355 forks source link

Caching and Reuse of PDFormXObject (embedded pdf via img tag) #782

Open ganomi opened 2 years ago

ganomi commented 2 years ago

Hi,

we are rendering pdfs with multiple pages where the header of each page contains a logo. This logo comes from another pdf file. Embedding this logo.pdf via img tag works well, but the resulting PDF file grows a lot with each page, because the logo is embedded as a new XObject instance again and again.

When i looked at the code i understood that no further manipulation is done to the PDFormXObject after it has been created until it is added to the pdf dictionary.

Therefore i added a little cache so that pdfbox will receive the same PDFormXObject instance for a specified src url + page and will recognize that this element is already part of the pdf dictionary and will reuse the already existing object.

Here is my code example. For me this works and file size is way smaller. Does anyone know if this might not be a good idea?

This code is in com.openhtmltopdf.pdfboxout.PdfBoxPDFReplacedElement See the use of pdfCache map.

private static Map<String, PDFormXObject> pdfCache = new HashMap<>();

  public static PdfBoxPDFReplacedElement create(PDDocument target, byte[] pdfBytes, Element e, Box box, CssContext ctx, SharedContext shared) {
      try (PDDocument srcDocument = PDDocument.load(pdfBytes)){
          int pageNo = parsePage(e);
          if (pageNo >= srcDocument.getNumberOfPages()) {
              XRLog.log(Level.WARNING, LogMessageId.LogMessageId0Param.LOAD_PAGE_DOES_NOT_EXIST_FOR_PDF_IN_IMG_TAG);
              return null;
          }

          PDPage page = srcDocument.getPage(pageNo);
          float conversion = 96f / 72f;
          float width = page.getMediaBox().getWidth() * shared.getDotsPerPixel() * conversion;
          float height = page.getMediaBox().getHeight() * shared.getDotsPerPixel() * conversion;

          LayerUtility util = new LayerUtility(target);

          String cacheKey = e.getAttribute("src") + pageNo;

          PDFormXObject formXObject = pdfCache.get(cacheKey);

          if (formXObject == null){
              formXObject = util.importPageAsForm(srcDocument, page);
              pdfCache.put(cacheKey, formXObject);
          }

          return new PdfBoxPDFReplacedElement(formXObject, e, box, ctx, shared, width, height);
      } catch (InvalidPasswordException e1) {
          XRLog.log(Level.WARNING, LogMessageId.LogMessageId0Param.EXCEPTION_TRIED_TO_OPEN_A_PASSWORD_PROTECTED_DOCUMENT_AS_SRC_FOR_IMG, e1);
      } catch (IOException e1) {
          XRLog.log(Level.WARNING, LogMessageId.LogMessageId0Param.EXCEPTION_COULD_NOT_READ_PDF_AS_SRC_FOR_IMG, e1);
      }

      return null;
  }

And a question on the side: Why are normale images like png automatically reused?

ganomi commented 2 years ago

My main question in regards to this change is, if it might lead to

If it is only the second point we could also work with an html attribute like ohtml-reuse-pdf-instance="true" to mark the PDF images, where instances should be reused because the user knows reuse will be unproblematic since all pages have the same settings.

ganomi commented 2 years ago

I can see now that the static map is a little naive, since on a second rendering this causes issues, because the objects in there are already closed. So this kind of cache would need to be per render instance.

ganomi commented 2 years ago

Just wanted to share my current solution how i make embedded PDF Objects reusable when referencing the same src multiple times e.g. in headers.

With AspectJ i intercept the instantiation of PdfBoxPDFReplacedElement inside com.openhtmltopdf.pdfboxout.PdfBoxPDFReplacedElement#create

Then i do some equal checks against my ThreadLocal cache and either reuse elements from it or populate the cache. I only do this for elements marked with the attribute data-cms-flags="reusecache" since i am still not sure about PDF standard conformance.

But at least in my usecases VeraPDF does not give additional errors when reusing the Objects and Adobe, Foxit and Firefox show the PDF properly.


public class MyAspect {

    public ThreadLocal<Map<EmbeddedPdfCacheKey, SoftReference<PDFormXObject>>> pdfCache = new ThreadLocal<>().withInitial(HashMap::new);

    @Around("execution(com.openhtmltopdf.pdfboxout.PdfBoxPDFReplacedElement.new(..)) && within(com.openhtmltopdf.pdfboxout.PdfBoxPDFReplacedElement) && args(formXObject, e, box, ctx, shared, width, height)")
    public Object create(ProceedingJoinPoint joinPoint, PDFormXObject formXObject, Element e, Box box, CssContext ctx, SharedContext shared, float width, float height) throws Throwable {

        boolean useCacheForThisElem = StringUtils.containsIgnoreCase(e.getAttribute("data-cms-flags"), "reusecache");
        if (!useCacheForThisElem) {
            return joinPoint.proceed();
        }

        Map<EmbeddedPdfCacheKey, SoftReference<PDFormXObject>> cache = pdfCache.get();

        Set<EmbeddedPdfCacheKey> keys = cache.keySet();

        if (keys.size() > 0 && keys.stream().anyMatch(key -> shared != key.shared)) {
            logger.info("PDFCACHE Clearing cache for new shared context");
            cache.clear();
        }

        EmbeddedPdfCacheKey
            cacheKey = new EmbeddedPdfCacheKey(e.getAttribute("src"), parsePage(e), width, height, ctx, shared);

        SoftReference<PDFormXObject> formXObjectFromCacheSoft = cache.get(cacheKey);

        PDFormXObject formXObjectFromCache = null;

        if (formXObjectFromCacheSoft != null) {
            PDFormXObject formXObjectFromCacheHard = formXObjectFromCacheSoft.get();
            if (formXObjectFromCacheHard != null) {
                formXObjectFromCache = formXObjectFromCacheHard;
            } else {
                logger.info("PDFCACHE Clearing cache for soft reference of resource " + cacheKey.url);
                cache.remove(cacheKey);
            }
        }

        if (formXObjectFromCache != null) {
            logger.info("PDFCACHE Reusing cached PDF for " + e.getAttribute("src") + " with Object " + formXObjectFromCache);
            return joinPoint.proceed(new Object[]{formXObjectFromCache, e, box, ctx, shared, width, height});
        } else {
            logger.info("PDFCACHE Saving PDF to cache for " + e.getAttribute("src") + " with Object " + formXObject);
            pdfCache.get().put(cacheKey, new SoftReference<>(formXObject));
            return joinPoint.proceed();
        }
    }
}

In another aspect i also clear the cache explicity since it should not be reused for another PDF run anyway

    @Around("execution(void com.openhtmltopdf.render.displaylist.PagedBoxCollector.collect(com.openhtmltopdf.css.style.CssContext, com.openhtmltopdf.layout.Layer))")
    public void interceptFinalLayout(ProceedingJoinPoint joinPoint) throws Throwable {

        // previously we filled a cache, lets clear it asap which is here.
        if (pdfCache.get() != null && pdfCache.get().size() > 0){
            logger.info("PDFCACHE Clearing pdf cache with size {}", pdfCache.get().size());
            pdfCache.get().clear();
        }

And for completeness here is the cache key class

public class EmbeddedPdfCacheKey {

    String url;
    int page;
    float width;
    float height;
    CssContext ctx;
    SharedContext shared;

    public EmbeddedPdfCacheKey(String url, int page, float width, float height, CssContext ctx,
                               SharedContext shared) {
        this.url = url;
        this.page = page;
        this.width = width;
        this.height = height;
        this.ctx = ctx;
        this.shared = shared;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) {
            return true;
        }
        if (!(o instanceof EmbeddedPdfCacheKey)) {
            return false;
        }
        EmbeddedPdfCacheKey cacheKey = (EmbeddedPdfCacheKey) o;
        return page == cacheKey.page && Float.compare(cacheKey.width, width) == 0 &&
            Float.compare(cacheKey.height, height) == 0 && url.equalsIgnoreCase(cacheKey.url) &&
            ctx.equals(cacheKey.ctx) && shared.equals(cacheKey.shared);
    }

    @Override
    public int hashCode() {
        return Objects.hash(url, page, width, height, ctx, shared);
    }
}