arachnys / athenapdf

Drop-in replacement for wkhtmltopdf built on Go, Electron and Docker
MIT License
2.26k stars 186 forks source link

Question: Processing extremely large HTML documents #125

Open seeroush opened 7 years ago

seeroush commented 7 years ago

TL;DR - Can AthenaPDF be tuned to handle files 50+ MB in size? If not, is it a planned feature for a future release?

I recently started using AthenaPDF (Weaver) and am loving it! However, my application generates incredibly large HTML documents (50MB +). Even opening these files in a browser on my machine takes over 10-15 minutes to load. I can only imagine what our poor deployed instance of Weaver has to go through when it works with files this big.

I would adjust tuning parameters for memory/CPU, however, I think it is probably unrealistic for a headless browser to open a file this size in a single window. With that said, are there any plans for handling larger documents in chunks, and possibly merging the resulting files back into a single PDF document?

If anyone has tips, tricks, or any lessons learned when dealing with large files, I would appreciate the advice. I am open to potentially forking AthenaPDF to build out the feature, but wanted to be sure this wasn't already something coming up in the latest version.

Thanks!

MrSaints commented 7 years ago

Hi @Roosh513,

That's a great question. It depends on the nature of the file, but to be honest, I think most browser-based converters might struggle in this respect.

Chunking is actually a good way to handle a case like this. That is, chunking on the HTML side. The actual converter will not be able to figure out what it should chunk, but as the author of it, I am sure you can tune your application to do this (i.e. generate multiple HTML files to convert, and then you can merge the output together).

Otherwise, I am more than open to any PRs / implementation discussion.

I was actually considering a batch conversion feature (that is, convert multiple HTML inputs), but I felt like it was slightly out of scope. And baking other PDF editing features like concatenating, resizing, scaling, etc, will stretch the goal of this project. It is an interesting thought though...

Hope that helps.

seeroush commented 7 years ago

That is helpful. I know the paradigm this project tries to follow is "do one thing, and do it well". I realize adding the option to split/merge documents may go beyond the scope of what AthenaPDF should be capable of doing. If you don't mind, I'd like to keep this thread open for a bit as an open conversation. I like the flexibility of building PDFs easily with HTML/CSS, along with the useful bonus of UTF-8 document support out-of-the-box. The tradeoff is speed by a large margin for really large documents.

At this point, I'm trying to manage complexity. Splitting large documents into 100 pages creates 100 new possible points of failure. On one hand, it's nice to be able to scale horizontally to allow faster PDF page generation with parallel requests. On the other, this requires lots more coordination, state management, and hardware. I would be curious to hear others' experiences in having to deal with this, and would love to hear stories of how AthenaPDF was utilized to solve the problem.

MrSaints commented 7 years ago

@Roosh513 Yes, no worries, I can keep it opened.

At Arachnys, we are using HTML/CSS/JS for report creation as it allows us to take advantage of simple templating (e.g. in Django), and it is something most, if not all developers are familiar with.

There is definitely a tradeoff in terms of generation speed, but it is a tradeoff that we strongly feel is worth taking since it allows us to create, and edit the output of PDF documents very easily (compared to tinkering around with offsets in PostScript to get a decent layout).

react-pdf is also something we are looking into, but for now, generating PDFs through HTML gives us the most agility in report creation as you have already highlighted.

I understand that splitting a document into N pages will create N points of failure. But again, beyond the _"do one thing, and do it well" philosophy, the service itself will have no understanding of the best way to do this, and the end-user should be the one deciding this.

In terms of coordination, it feels like you can use something like Conductor or perhaps create your own program that is backed by a message queue so you can perform retries should any one document fail.

We actually have a rather complex HTML -> PDF pipeline as well, and it is mostly done in a Python Celery task.

seeroush commented 7 years ago

Conductor actually looks like a fantastic option and is a great idea, as it already supports orchestration of simple HTTP POST requests as a task. It would just be up to me to define the JOIN task to stitch them back together. I will explore this idea a bit since it looks like Conductor already does what I want and prevents me from having to manage task state myself. It's dockerized to boot!