Identify alternatives and decide on a PDF generator

OddArneSaetervik commented 2 years ago

Selvfølgelig vil dette være avhenging av hvordan HTML ene blir seende ut. Altså hvor rike de kan komme til å være. Dette antar vi vil komme til å bli avklart på #13

Identifiserte kandidater er:

https://itextpdf.com/en/products/itext-7/convert-html-css-to-pdf-pdfhtml
https://www.puppeteersharp.com/
Finne ut hvilke bibliotek/applikasjon som brukes av Altinn 2. Potensielt benytte Altinn 2 sin PDF tjeneste?
Issuet har også rom for å lete etter andre kandidater.

Avslutningsvis må man bestemme seg for hvilket alternativ man skal kjøre PoC på

Analysis

Licensing

Open source always come with a license. We must consider each one if we do decide on an open source product. Open source products can sometimes also have bought versions we could consider.

Rendering

There are primarily two type of products. The product has its own HTML renderer. The other kind of product use an integrated (or external) browser to perform the rendering and a "save as PDF" type of mechanism. This last type of product will probably always render something closer to what the user would see in a browser.

Input types

The tools will either take an HTML document or a URL, or both as input.

PDF/A AND UA

For tools that don't support PDF/A, it might be possible to use a third tool like ghostscript. Another important standard is UA (universal accessibility). More info... Even more info...

Vekt

Output must be PDF-A3/UA 60% implementasjon effort 20% cost 20%

Candidates

Tool	License	Rendering	Input	PDF/A	PDF/UA	Comment
iText Paid	Paid	Custom?	HTML	Yes	Yes	€ 6540,-/1 year for 60 000 PDF. Need quote for higher volumes.
Puppeteer Sharp	MiT	Browser	URL	No	?	Small "private" project. Active with release 7.0 in april.
Syncfusion	Paid	Chromium/ QtWebkit	Both	Yes	Yes	Developer packs too small (5). Need quote for project license.
IronPdf	Paid	Chromium	Both	No	Yes	Developer licenses. $ 3000,- No limit, perpetual. Support $ 400,- per year.
PDFReactor	Paid	Custom?	URL	Yes	Yes	CPU cores licenses. This means simultaneous PDF conversions. If we have multiple machines we would need multiple licenses. Available as a cloud service and as a Java library.
ExpertPdf	Paid	Custom?	Both	No?	?	Unable to find any information about accessibility support.

Discarded candidates

Tool	Comment
iText OS	The Open Source variant of this library has an aggressive license. The AGPL license makes it hard to use it as open source as it could force us to use the AGLP license as well.
EO.Pdf	Can't find any new documentation about EO supporting accessibility or is able to produce tags for a text reader. No support?
SelectPdf	There is a free community edition version, but it looks like there is a limit to 5 pages. Requires a full version Windows as it relies on system libraries not available in Core or Nano. We want a lightweight OS for our containers.
dompdf	PHP application. Can load HTML and output a PDF. Does not support adding tags (accessibility).
wkhtmltopdf	Command line tool. Might be a dieing project. Excluding because of the uncertain future.

bengtfredh commented 2 years ago

Et par verktøy man kan ta med i vurderingen: https://github.com/dompdf/dompdf # HTML to PDF converter https://wkhtmltopdf.org/ # command line tools to render HTML into PDF

tba76 commented 2 years ago

Altinn 2 bruker Essential Objects sin pdf-løsning. https://www.essentialobjects.com/#Pdf

olemartinorg commented 2 years ago

Jeg kan nevne at jeg har hatt gode erfaringer tidligere med wkhtmltopdf. Det er prinsipielt et ganske enkelt verktøy, og det blir omtrent det samme som å printe en nettside fra Chrome. Det gir i alle fall mye fleksibilitet uten at man trenger å lære seg et eget bibliotek spesifikt for å lage PDFer.

elsand commented 2 years ago

SelectPDF er Windows-only, og bruker System.Drawing-API-ene. Disse er ikke tilgjengelig på Windows Server Core eller Windows Nano (https://docs.microsoft.com/en-us/dotnet/api/system.drawing?view=net-6.0) så denne tror jeg ikke er aktuell.

elsand commented 2 years ago

https://docs.microsoft.com/en-us/dotnet/core/compatibility/core-libraries/6.0/system-drawing-common-windows-only

Er muligens andre av de .NET-baserte på lista som benytter System.Drawing, som da i praksis betyr windows-only. Bruk av libgdiplus som workaround på Linux fjernes fra .NET7. (SelectPDF har i tillegg bindinger til kernel32.dll)

Forøvrig verdt å merke seg det som nevnes på https://wkhtmltopdf.org/status.html, nemlig at man ikke kan bruke dette på "untrusted html". Dette gjelder nok i en eller annen grad for alle bibliotekene som tar html/css/js som input. Hvis dette er tenkt som en sentral service, vil kreve kontroll på hvorvidt appene kan direkte påvirke hvordan html-en blir generert - om de skal ha det i det hele tatt. Dette må da balanseres opp mot appenes behov for fleksibilitet i hvordan pdf-ene blir generert.

elsand commented 2 years ago

Har testet litt med IronPDF. Denne bruker også som default Chromium behind the scenes. I utgangspunktet henter den ned Chrome og andre dependencies runtime(!), som ble en smule knot å få til med permissions etc i Docker. Men man kan hente inn en spesialversjon av nuget-en, IronPdf.Linux samt den rendreren en går for, f.eks. IronPdf.Native.Chrome.Linux, som da sørger for å hente alt ved byggetid. Dette fører til at det legges inn en runtimes-mappe i output, som inneholder binaries herunder Chrome. /app/runtimes/linux-x64/native er 411MB på imaget.

Ting jeg noterte meg:

Alpine er IKKE støttet pga. at musl ikke støttes av Chromium. Jeg brukte i stedet -focal-taggen på imagene som hentes, som er Ubuntu 20.
Ethvert forsøk på kjøretids installasjon av assets feiler, ikke alltid med meningsfulle meldinger i Insights. Dette må deaktiveres ref. https://ironpdf.com/docs/questions/docker-linux/ (sett IrontPdf.Installation.LinuxAndDockerDependenciesAutoConfig=false;
Dockerfile må inneholde RUN chmod 755 runtimes/linux-x64/native/IronCefSubprocess
Jeg er usikker på hvor trådsikkert og skalerbart dette er, siden det brukes en subprosess, men virker å være ganske raskt

Ellers nyttig info:

SandGrainOne commented 2 years ago

Foreløpig IronPdf og iText som kanskje peker seg ut så langt. Veldig bra at du tester ut IronPdf.

SandGrainOne commented 2 years ago

Quote HansO

HTML to PDF på server-side er et lite monster. Er skeptisk til at en app-utvikler plutselig kan ta over platform-klusteret ved en sårbarhet i pdf-generatoren. Sååå mye trust har vi ikke i dem

altinnadmin commented 2 years ago

My thoughts...

Principles for PDF generation

PDF generation should "just work"

When new UI components, widgets, page layouts, process steps, dynamics, translations or other new app features are introduced, PDFs should be generated without requiring additional effort.

How to achieve this:

The apps must produce the HTML that is to be converted to PDF
This is the same HTML that is produced when a user opens an app in a browser, and it opens up the possibility of generating PDFs for all kind of app pages
PDF generation must use a widely used browser engine for rendering.
Chromium is the most popular engine in the world by far.

PDFs should look good

When creating an app with a nice responsive user interface, the generated PDF should automatically reflect this interface.

How to achieve this:

We use print CSS to optimize print and PDF
App developers use ourgrid system to optimize the responsive layout for various sizes, including print/PDF.
PDF generation must use a modern browser engine for rendering. Chromium-based browsers have been in the lead for a long time.

PDFs should be accessible

The generated PDFs must be accessible and follow WCAG and the requirements for PDF/UA

How to achieve this:

Focus on WCAG 2.1 support in the app front-end. This is the foundation.
Generate tagged PDFs from the app HTML. Headless Chrome supports this.
Embed fonts in PDF. This is also a central requirement for PDF/A.

PDF generation should be scalable and robust

Generating PDFs can be heavy, typically consuming a lot of compute and memory. When load increases, PDF generation needs to scale automatically, and scaling should only impact the relevant app owner.

How to achieve this:

We run the PDF generator in containers running in each apps cluster, like KubernetesWrapper
We should not run PDF generation as part of the apps
When load increases, pods and nodes for only the relevant app owner can scale automatically, independent of Platform
This means cost related to PDF-generation is directly linked to the cluster of the app owner

PDF generation should be secure

Rendering HTML and running javascript server side in a browser is a security challenge, since we can't trust the HTML. This challenge needs to be handled.

How to achieve this:

We run the PDF generator in containers running in each apps cluster, isolating rendering of HTML from Platform
Chromium implements a sandbox
A container is also a kind of sandbox that can be hardened
Patching of Chromium needs to be done continuously
We should limit the allowed URL input
Authentication needs to be handled

PDF generation should follow the Altinn 3 architecture principles

Altinn 3 is following a set of architecture principles, PDF generation should do that as well.

Relavant Altinn 3 principles:

Free and open-source software
- Chromium is open source
Isolation. See chapter above.
Security in depth. Se chapter above.

How this could be implemented

We create a new container with headless chromium and a simple API for returning PDF based on an URL.
- We could use the open source browserless container image for a first proof of concept.
- Here is also some Go-based inspiration
The API is used by app backend, in a similar way as app backend today calls altinn-pdf, except that the input is an app instance URL and cookie.
App backend stores the PDF in Storage, just as today. The PDF container should not have access to Platform APIs.
To be able to store everything in a PDF we need to create a new "everything-view" in app-frontend, that concatinates all pages and applies all dynamics.
- This is something the Studio team need to do anyways to be able to answer the "what did I submit as a user?" question for archived apps.
- Other pages like Summary, Receipt, Payment, etc. can also be convertable to PDF in the same way, all we need is an URL

Risks

Performance and time used for rendering. How fast is headless chromium?
Security, isolation between user sessions i Chrome
Stability, Chrome can be a hog and a beast

ivarne commented 2 years ago

@altinnadmin Great writeup!

We considered the performance and stability issues when solving the same issue at UDI, and our architect used ServiceBus to ensure that we could do a series of async opreations on different azure functions with proper retry, and also the possibility of saying OK to the user and process the pdf generation backlog at a later time.

TheTechArch commented 2 years ago

I believe that some orgs would want to create their PDF view not based on an automatic merging of all pages. There are probably lots of services that want a different "custom view". Should this be covered?

altinnadmin commented 2 years ago

There are probably lots of services that want a different "custom view". Should this be covered?

Yes, if that's a need we could add config, so that for example the Summary page could be used instead, or perhaps in addition. Remember that the user probably has a need to be able to see everything regardless of the need of the app owner.

Ref. bullet 4 above: "Other pages like Summary, Receipt, Payment, etc. can also be convertable to PDF in the same way, all we need is an URL"

ivarne commented 2 years ago

Yes, if that's a need we could add config

Or a reasonable extention point where app devs can add their own logic for generating artefacts at appropriate times. Config can come later when we want to make it possible to do in Altinn Studio.

ddrune commented 2 years ago

@SandGrainOne , @altinnadmin I know of some good alternatives that offer sandboxing through either containerization, file-based ingress/egress or use through APIs. The famous and extensive pandoc framework use xelatex (i.e XeTeX) - but through LaTeX-format and files (UNIX philosophy). But interestingly pandoc also use prince, which is free to non-commercial - while governmental use has a fee. They are also governed under Norwegian license law. They offer supreme PDF/UA capabilities as well, and file-based ingress/egress.

altinnadmin commented 2 years ago

Thanks @ddrune . I had a look at the wikipedia page for Prince:

CSS Grid Layout (css-grid-1) is not yet present in Prince 14.

and

Prince supports most of ECMAScript 5th edition, but not strict mode. Later editions of ECMAScript are largely not supported.

I don't think prince is able to fulfill our first principle, PDF generation should "just work", given that our apps are modern React-applications. A lot has happened since the 5th edition of ecmascript was released way back in 2009.

SandGrainOne commented 1 year ago

We eventuelly landed on using latest version browserless/chrome from browserless.io. This will be hosted together with the application owner apps in the app owner spesific AKS. I suspect that we would want to eventually replace it with something where we have more direct control, but for now this issue is being closed as completed.

Altinn / altinn-pdf