CDLUC3 / dmptool

DMPTool version of the DMPRoadmap codebase
https://dmptool.org
MIT License
56 stars 13 forks source link

Replace wkhtmltopdf for PDF generation #604

Open briri opened 3 weeks ago

briri commented 3 weeks ago

The DMPTool is currently using the wkhtmltopdf C package to handle this, but it is no longer maintained and currently preventing us from migrating that system to Amazon Linux 2023.

We now find that we are unable to upgrade node on the DMPTool system due to its being stuck on Amazon Linux 2. This is preventing us from deploying at the moment ☹️

The new DMPTool rebuild will also require the creation of a service that can transform HTML and/or a JSON object into a PDF document.

Some options:

Our PDFs are pretty simple. We need to support fonts, css, tables lists and non utf-8 charsets

briri commented 2 weeks ago

Thanks @jupiter007 for sharing the link to this article which has a good overview of the options out there for generating PDF docs.

Our use of TinyMCE means that all of a DMPs questions and answers are stored in the DB as HTML snippets. For example:

<p><strong>Gaining Consent</strong></p>
<p><em>Informed Consent</em><strong><em>:</em></strong> All participants and their guardians will provide informed consent for the collection, preservation, and sharing of data. Consent forms will clearly outline the purpose of the study, the types of data collected, how the data will be used, and the measures taken to protect privacy and confidentiality.</p>

This is what drove the original decision to use WickedPDF (which uses wkhtmltopdf under the hood ... the basis of our problem). Because we are storing the HTML it is very easy to just send it as is to an HTML -> PDF converter.

Here is an analysis of some options available to us:

I am currently leaning towards Puppeteer and will start exploring it in more detail tomorrow. It is JS based and well maintained, so will be useable in the current DMPTool as well as the rebuild.

HTML -> PDF conversion approach HTML to PDF converters pose several problems overall: they struggle to provide PDF specific features like a ToC, headers, footers and are generally not accessible.

The ones I reviewed are:

There are also options like DocRaptor and Prince but they are paid services and quite expensive. We currently generate around 6,000 PDFs a month.

While this would be the easiest and fastest approach for this current issue it may not be the best longer term.

Direct PDF generation This approach is likely better in the long run. While they can be quite complex and have a bit of a learning curve, the PDFs we generate are very simple. They would also potentially benefit from some of the features above that the HTML to PDF options do not support like headers/footers and we can produce accessible PDFs.

-Prawn (Ruby) - Core branch was last update a few months ago, but last official release was in 2020. Apparently has very limited styling options for styling but looking at the docs it -PDF Lib (JS) - This one is well maintained and popular. I think the biggest issue will be handling tables in user answers.

Some things that we will need to thoroughly investigate are support for accurately generating user's answers in their DMPs that contain things like non utf-8 characters, tables, lists, etc.

We will have to build code to handle tables properly if we use this approach.

bofstein commented 2 weeks ago

For the paid option, the latter one provides an academic license of 12 months for $1900. Depending on how much of your (valuable) time creating our own would take, it's possible this cost could be worth it even if we also slowly generate our own within a year. I don't know our budget or cost-sensitivity situation to say if it makes sense, but it seems worth considering if the others would take many days of work and this could be done quickly.

marisastrong commented 2 weeks ago

Thanks for providing that option. I wish procuring software could be quick at UCOP; unfortunately it could be a matter of months before we could actually purchase due the procurement process.

We would need to provide some ROI analysis for the cost justification so cost would not necessarily be a blocker.

briri commented 2 weeks ago

Some initial analysis of using puppeteer:

Screenshot 2024-06-14 at 8 28 58 AM
briri commented 2 weeks ago

The Grover gem can be added directly to the Rails app. It uses puppeteer under the hood.

I think this will be slightly less work to get setup, but it doesn't port to the DMPTool rebuild project. That's likely not a big deal though. I was able to get puppeteer up and running locally with around 20 lines of code, so the implementation in the new system should be pretty easy

Either way, the bulk of the work will be in untangling the current PDF from the codebase and replacing it with Grover. I think it will only take a few days to get sorted out and tested. Some of the timing will be tied to IAS availability to provision the new Amazon Linux 2023 servers for us.

briri commented 2 weeks ago

:/ Grover will be painful. The code changes in Rails aren't terribly significant but after an hour of trying to get Chromium installed and configured I am concerned about how we will manage it on the servers.

bofstein commented 1 week ago

@mariapraetzellis to add font requirements