Altinn / app-template-dotnet

Altinn Studio application template
BSD 3-Clause "New" or "Revised" License
2 stars 8 forks source link

Analysis - Challenges with the current PDF service #214

Open bjosttveit opened 5 months ago

bjosttveit commented 5 months ago

Description

There are some areas that could be improved related to the PDF generation process that are directly tied to the PDF service itself, browserless (v1).

In scope

Out of scope

Additional Information

No response

Analysis

Identified issues related to the current PDF service

  1. Browserless v1 seems to no longer get updates with new versions of chrome/puppeteer. Recent versions of chrome solves some accessibility issues we currently have.
  2. Browserless v1 captures the entire PDF in memory before sending it, as opposed to using puppeteers streaming API.
  3. Browserless seems to launch a fresh chrome instance for every request, this is a lot of overhead per request and probably not necessary.
  4. Chrome uses a lot of memory, and until recently the PDF-generator pods had very little memory allocated. Realistically, we probably cannot handle more than 2 concurrent requests if even that. Browserless does have a queue-system that can be configured to reduce the max number of concurrent requests before ending up in a queue. I am not sure if we currently set this, but the default seems to be 10 which is way too much for the resources we allocate.
  5. AFAIK we currently dont do any request caching. Loading the app-frontend bundle (and other assets from CDN) for every request is needlessly slow when it could be cached. Not sure how well the userDataDir solution in browserless would work.
  6. Browserless' waitFor option is quite limited. Ideally, we would like for app-frontend to be able to signal if loading the page failed, so we can quickly respond with an error. Today, if app frontend shows the "unknown error" page, that page will get printed. The alternative is waiting 30 seconds for it to time out if the #readyForPrint element is not found.
    • We currently do not allow the selector timeout to be configured in the app, although this is possible with the current browserless pdf API. It is not however possible to configure a second "error"-selector that will cause it to fail fast.
  7. The PDF generator seems to fail way to often. Its difficult for me to say exactly why, as I do not have access to logs, but crashing due to using too much memory from (2.) and possibly (4.) could be releated. However, there are also cases where the browser does not crash but the page simply not loading properly (as was recently reported). The cause of this is even harder to tell, it could be something in app-frontend or possibly related to the PDF service itself.

Possible solutions:

Upgrading browserless to v2
  1. Would probably solve (1.) as I assume this is kept more up-to-date with recent chrome/puppeteer versions.
  2. Would also solve (2.) as mentioned in their migration guide
  3. I have not tested this version, so not sure if it still lauches a new instance per request or not.
  4. Probably the same situation for (4.)
  5. Probably the same situation for (5.)
  6. Although the waitFor API has been extended to support more puppeteer waitFor-methods, there is still no possibility of specifying a selector that causes fast failure.
  7. Would need to test extensively to know if this has improved.
Creating our own chrome-based PDF generator

First of all, browserless seems to be a relatively thin wrapper around puppeteer (with a lot more functionality we dont need), so creating something similar just for generating PDFs seems to be relatively straight forward.

As for the issues:

  1. We would maintain dependencies ourselves, and puppeteer gets support for updated chrome versions relatively quickly.
  2. Using PDF streaming in for example puppeteer is trivial.
  3. We would choose ourselves how we manage browsers, contexts, and pages.
  4. We should still definitely queue requests to prevent many concurrent chrome pages
  5. We could trivially implement a fast memory-cache for CDN-assets, which could speed up generation significantly compared with no caching.
  6. We would have full control of when to print and when to error, based on for example puppeteers built in methods.
  7. Would still require extensive testing to identify and mitigate failure.

Conclusion

Even though browserless v2 should be an improvement in several areas, accessibility and possibly stability(?), its waitFor support is still very limited and not optimal to our use case. Upgrading browserless to v2 is probably the easiest, as the API is not very different. However, if we create our own, we could tailor it more to suit our needs where browserless falls short.

bjosttveit commented 3 months ago

Another option is using Playwright with C#, what all of these options have in common is that they all use the Chrome Dev-Tools Protocol (CDP) under the hood anyway, so in a sense they are all the same, and only differ in how much functionality and stability each tool can provide out-of-the-box. One thing that Playwright is missing is built in functionality for returning PDFs as streams, and only supports the "old" base64 option. However, I am pretty sure it is possible to make direct CDP-calls to the browser anyway, so this could probably be solved by implementing a function for handling this ourselves.

Since Playwright is maintained by Microsoft, it will probably have great support for the .NET libraries for the forseable future.

adamhaeger commented 3 months ago

Really good analysis @bjosttveit ! Another issue to consider is auth, we should have the option for machine to machine auth somehow, will be important for watermark functionality and other things :)

bjosttveit commented 3 months ago

Absolutely! Although I don't think the PDF service itself needs to concern itself with this. As long as the app is able to use a machine-token instead of the users token to access the required data it can just pass this on to the PDF-service the same way we do with the users token today, I think!

Unless I misunderstood the issue, how is machine auth related to watermarking the PDF?

Magnusrm commented 1 month ago

Based on the analysis done on this issue we need a discussion resulting in a decision about how to move forward with implementation.

bjosttveit commented 2 weeks ago

Update: It looks like the API documentation for browserless v1 has been removed, been getting a 404 for a couple of days now: https://chrome.browserless.io/docs/#/Browser%20API/post%20pdf