datadryad / dryad-product-roadmap

Repository of issues for Dryad project boards
https://github.com/orgs/datadryad/projects
8 stars 0 forks source link

Investigate pdf generation tools #3279

Closed DragosIorgulescu closed 1 month ago

DragosIorgulescu commented 7 months ago

Based on slack discussion:

So based on my attempts today to install and make wkhtmltopdf run on our servers today, it would appear that this is not supported on AMazon Linux 2 servers due to the libpng15 dependency which is a package that is no longer supported. My investigation got me to the point of attempting to install libpng from source, and the only source I could find was from here: http://www.libpng.org/pub/png/libpng.html -- which eventually takes you to a sourceforge url here: https://sourceforge.net/projects/libpng/files/libpng15/1.5.30/

This presents 2 issues:

  • first, sourceforge is not the most trustworthy IMO, and they don't even give you the download URL directly for a package, so getting the tar to build from source is a major pain
  • we would be trying to install a library that is known to have security vulnerabilities and which is no longer supported (since 2017) I can try and bang on this a bit more, but I am more and more convinced that switching to a lambda with wkhtmltopdf set up would be the way to go.

Based on a search in the codebase I see 3 rake tasks we need this for:

  • identifiers:deferred_journal_reports here
  • identifiers:tiered_journal_reports here
  • identifiers:tiered_tenant_reports here
  • and of course the new GREI monthly report for which we have a PR

All of these make use of the .write_deferred_sponsor_summary and writed_tiered_sponsor_summary methods

I would need to better understand the use cases here, and if there are other usages that I am not aware of, but at least the GREI report I am sure we can handle via lambda. The other 3 above I am not sure, as it seems they are generated as a part of CSV generation? I would need more context around them

Other notes

regarding alternatives, Prawn is a good one, but last I checked it, it does not work the same way by conveting html pages to PDFs, the process of creating a PDF is a bit more involved since it requires using their DSL (https://dryad.slack.com/archives/D069KA2E814/p1712166223708789)

I personally kind of like the ability to use html templates to generate PDFs from, before exploring alternatives I would still try to use wkhtmltopdf, just outside the amazon server

We may also try to install an older version of the binary gem, I've come accross this comment as well

Other alternatives to look into for html to PDF generation: https://wkhtmltopdf.org/status.html#recommendations -- https://doc.courtbouillon.org/weasyprint (seems to be actively maintained, no ruby gem though, but should be usable as an independent lib)

ryscher commented 2 months ago

PDFs were previously generated by WickedPDF gem, but it no longer works on our OS.

Code that uses Wicked PDF is in lib/tasks/stash_engine_tasks.rake, for example in write_deferred_sponsor_summary