Altinn / altinn-pdf

Altinn platform microservice for generating PDFs
0 stars 1 forks source link

Identify alternatives and decide on a PDF generator #16

Closed OddArneSaetervik closed 1 year ago

OddArneSaetervik commented 2 years ago

Selvfølgelig vil dette være avhenging av hvordan HTML ene blir seende ut. Altså hvor rike de kan komme til å være. Dette antar vi vil komme til å bli avklart på #13

Identifiserte kandidater er:

Avslutningsvis må man bestemme seg for hvilket alternativ man skal kjøre PoC på

Analysis

Licensing

Open source always come with a license. We must consider each one if we do decide on an open source product. Open source products can sometimes also have bought versions we could consider.

Rendering

There are primarily two type of products. The product has its own HTML renderer. The other kind of product use an integrated (or external) browser to perform the rendering and a "save as PDF" type of mechanism. This last type of product will probably always render something closer to what the user would see in a browser.

Input types

The tools will either take an HTML document or a URL, or both as input.

PDF/A AND UA

For tools that don't support PDF/A, it might be possible to use a third tool like ghostscript. Another important standard is UA (universal accessibility). More info... Even more info...

Vekt

Output must be PDF-A3/UA 60% implementasjon effort 20% cost 20%

Candidates

Tool License Rendering Input PDF/A PDF/UA Comment
iText Paid Paid Custom? HTML Yes Yes € 6540,-/1 year for 60 000 PDF. Need quote for higher volumes.
Puppeteer Sharp MiT Browser URL No ? Small "private" project. Active with release 7.0 in april.
Syncfusion Paid Chromium/ QtWebkit Both Yes Yes Developer packs too small (5). Need quote for project license.
IronPdf Paid Chromium Both No Yes Developer licenses. $ 3000,- No limit, perpetual. Support $ 400,- per year.
PDFReactor Paid Custom? URL Yes Yes CPU cores licenses. This means simultaneous PDF conversions. If we have multiple machines we would need multiple licenses. Available as a cloud service and as a Java library.
ExpertPdf Paid Custom? Both No? ? Unable to find any information about accessibility support.

Discarded candidates

Tool Comment
iText OS The Open Source variant of this library has an aggressive license. The AGPL license makes it hard to use it as open source as it could force us to use the AGLP license as well.
EO.Pdf Can't find any new documentation about EO supporting accessibility or is able to produce tags for a text reader. No support?
SelectPdf There is a free community edition version, but it looks like there is a limit to 5 pages. Requires a full version Windows as it relies on system libraries not available in Core or Nano. We want a lightweight OS for our containers.
dompdf PHP application. Can load HTML and output a PDF. Does not support adding tags (accessibility).
wkhtmltopdf Command line tool. Might be a dieing project. Excluding because of the uncertain future.
bengtfredh commented 2 years ago

Et par verktøy man kan ta med i vurderingen: https://github.com/dompdf/dompdf # HTML to PDF converter https://wkhtmltopdf.org/ # command line tools to render HTML into PDF

tba76 commented 2 years ago

Altinn 2 bruker Essential Objects sin pdf-løsning. https://www.essentialobjects.com/#Pdf

olemartinorg commented 2 years ago

Jeg kan nevne at jeg har hatt gode erfaringer tidligere med wkhtmltopdf. Det er prinsipielt et ganske enkelt verktøy, og det blir omtrent det samme som å printe en nettside fra Chrome. Det gir i alle fall mye fleksibilitet uten at man trenger å lære seg et eget bibliotek spesifikt for å lage PDFer.

elsand commented 2 years ago

SelectPDF er Windows-only, og bruker System.Drawing-API-ene. Disse er ikke tilgjengelig på Windows Server Core eller Windows Nano (https://docs.microsoft.com/en-us/dotnet/api/system.drawing?view=net-6.0) så denne tror jeg ikke er aktuell.

elsand commented 2 years ago

https://docs.microsoft.com/en-us/dotnet/core/compatibility/core-libraries/6.0/system-drawing-common-windows-only

Er muligens andre av de .NET-baserte på lista som benytter System.Drawing, som da i praksis betyr windows-only. Bruk av libgdiplus som workaround på Linux fjernes fra .NET7. (SelectPDF har i tillegg bindinger til kernel32.dll)

Forøvrig verdt å merke seg det som nevnes på https://wkhtmltopdf.org/status.html, nemlig at man ikke kan bruke dette på "untrusted html". Dette gjelder nok i en eller annen grad for alle bibliotekene som tar html/css/js som input. Hvis dette er tenkt som en sentral service, vil kreve kontroll på hvorvidt appene kan direkte påvirke hvordan html-en blir generert - om de skal ha det i det hele tatt. Dette må da balanseres opp mot appenes behov for fleksibilitet i hvordan pdf-ene blir generert.

elsand commented 2 years ago

Har testet litt med IronPDF. Denne bruker også som default Chromium behind the scenes. I utgangspunktet henter den ned Chrome og andre dependencies runtime(!), som ble en smule knot å få til med permissions etc i Docker. Men man kan hente inn en spesialversjon av nuget-en, IronPdf.Linux samt den rendreren en går for, f.eks. IronPdf.Native.Chrome.Linux, som da sørger for å hente alt ved byggetid. Dette fører til at det legges inn en runtimes-mappe i output, som inneholder binaries herunder Chrome. /app/runtimes/linux-x64/native er 411MB på imaget.

Ting jeg noterte meg:

Ellers nyttig info:

SandGrainOne commented 2 years ago

Foreløpig IronPdf og iText som kanskje peker seg ut så langt. Veldig bra at du tester ut IronPdf.

SandGrainOne commented 2 years ago

Quote HansO

HTML to PDF på server-side er et lite monster. Er skeptisk til at en app-utvikler plutselig kan ta over platform-klusteret ved en sårbarhet i pdf-generatoren. Sååå mye trust har vi ikke i dem

altinnadmin commented 2 years ago

My thoughts...

Principles for PDF generation

PDF generation should "just work"

When new UI components, widgets, page layouts, process steps, dynamics, translations or other new app features are introduced, PDFs should be generated without requiring additional effort.

How to achieve this:

PDFs should look good

When creating an app with a nice responsive user interface, the generated PDF should automatically reflect this interface.

How to achieve this:

PDFs should be accessible

The generated PDFs must be accessible and follow WCAG and the requirements for PDF/UA

How to achieve this:

PDF generation should be scalable and robust

Generating PDFs can be heavy, typically consuming a lot of compute and memory. When load increases, PDF generation needs to scale automatically, and scaling should only impact the relevant app owner.

How to achieve this:

PDF generation should be secure

Rendering HTML and running javascript server side in a browser is a security challenge, since we can't trust the HTML. This challenge needs to be handled.

How to achieve this:

PDF generation should follow the Altinn 3 architecture principles

Altinn 3 is following a set of architecture principles, PDF generation should do that as well.

Relavant Altinn 3 principles:

How this could be implemented

  1. We create a new container with headless chromium and a simple API for returning PDF based on an URL.
  2. The API is used by app backend, in a similar way as app backend today calls altinn-pdf, except that the input is an app instance URL and cookie.
  3. App backend stores the PDF in Storage, just as today. The PDF container should not have access to Platform APIs.
  4. To be able to store everything in a PDF we need to create a new "everything-view" in app-frontend, that concatinates all pages and applies all dynamics.
    • This is something the Studio team need to do anyways to be able to answer the "what did I submit as a user?" question for archived apps.
    • Other pages like Summary, Receipt, Payment, etc. can also be convertable to PDF in the same way, all we need is an URL

Risks

ivarne commented 2 years ago

@altinnadmin Great writeup!

We considered the performance and stability issues when solving the same issue at UDI, and our architect used ServiceBus to ensure that we could do a series of async opreations on different azure functions with proper retry, and also the possibility of saying OK to the user and process the pdf generation backlog at a later time.

TheTechArch commented 2 years ago

I believe that some orgs would want to create their PDF view not based on an automatic merging of all pages. There are probably lots of services that want a different "custom view". Should this be covered?

altinnadmin commented 2 years ago

There are probably lots of services that want a different "custom view". Should this be covered?

Yes, if that's a need we could add config, so that for example the Summary page could be used instead, or perhaps in addition. Remember that the user probably has a need to be able to see everything regardless of the need of the app owner.

Ref. bullet 4 above: "Other pages like Summary, Receipt, Payment, etc. can also be convertable to PDF in the same way, all we need is an URL"

ivarne commented 2 years ago

Yes, if that's a need we could add config

Or a reasonable extention point where app devs can add their own logic for generating artefacts at appropriate times. Config can come later when we want to make it possible to do in Altinn Studio.

ddrune commented 2 years ago

@SandGrainOne , @altinnadmin I know of some good alternatives that offer sandboxing through either containerization, file-based ingress/egress or use through APIs. The famous and extensive pandoc framework use xelatex (i.e XeTeX) - but through LaTeX-format and files (UNIX philosophy). But interestingly pandoc also use prince, which is free to non-commercial - while governmental use has a fee. They are also governed under Norwegian license law. They offer supreme PDF/UA capabilities as well, and file-based ingress/egress.

altinnadmin commented 2 years ago

Thanks @ddrune . I had a look at the wikipedia page for Prince:

CSS Grid Layout (css-grid-1) is not yet present in Prince 14.

and

Prince supports most of ECMAScript 5th edition, but not strict mode. Later editions of ECMAScript are largely not supported.

I don't think prince is able to fulfill our first principle, PDF generation should "just work", given that our apps are modern React-applications. A lot has happened since the 5th edition of ecmascript was released way back in 2009.

SandGrainOne commented 1 year ago

We eventuelly landed on using latest version browserless/chrome from browserless.io. This will be hosted together with the application owner apps in the app owner spesific AKS. I suspect that we would want to eventually replace it with something where we have more direct control, but for now this issue is being closed as completed.