boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
734 stars 155 forks source link

Interruption on to_pdf method after 60 seconds #178

Closed guilhermemaranhao closed 3 years ago

guilhermemaranhao commented 3 years ago

Hi everybody,

We're facing a timeout problem running combine_pdf in Production mode. Our PDF document is built with PDFKit as follow: 1 - Some pages have a html template populated with data; 2 - Other pages have images of PDF documents that are uploaded in our application. We have to insert the images because we must insert a watermark on these pages. To do it, we create a blank page with PDFKit, so that we can insert the pdf image and the watermark. After that, we push them in the combine_pdf object.

The last step is to invoke the to_pdf method.

combine_pdf.to_pdf

At this point, in production mode, for huge documents (having approximately 3 thousand pages or 50 Mb), the execution is interrupted with 60s. However, this situation does not happen on development mode. It takes all the necessary time to build the PDF document and it is never interrupted.

We think it's not related to the application timeout setting, as it is set to more than 5 minutes.

We suppose it has something to do with a timeout setting on combine_pdf gem. Does it proceed?

Can anybody give us a hint?

Thanks,

Guilherme

boazsegev commented 3 years ago

Hi Guilherme and thank you for opening this issue.

I haven't reviewed the full code in 2 years or so, but I'm fairly certain I never coded a timeout in the CombinePDF gem. This would be out of scope for the gem and something easily implemented using an external codebase.

Besides, I envisioned CombinePDF working as a side-job (i.e., using iodine or SideKiq) that would be pushed when ready (i.e., using WebSockets, SSE or Long-Polling).

Huge documents would always be slower, this is unavoidable. However, if I could view some of your code perhaps I could suggest optimizations since I do recall most of the internal logic and performance factors in the CombinePDF code base... but without more information I'm somewhat blind to whatever's going on.

From your description it seems possible that the development machine might have exclusive access to the CPU while the production machine is shared, degrading performance on the production machine. If this is the case, you might want to consider a different container

Another issue might come up if memory is swapped to disk when switching between shared containers - which will cause a huge drop in performance in production (vs. development).

IMHO, since your code obviously requires a long time to process huge PDF files, I would move all files over 1Mb to a side job and implement a polling / push client path.

Good luck! Bo.

guilhermemaranhao commented 3 years ago

An example of this situation in development mode:

It takes only 40 seconds to process 3076 pages, but it take 5:30 on the to_pdf method. In production, it's interrupted after 60 seconds.

Thanks

guilhermemaranhao commented 3 years ago

Great, @boazsegev !

Thanks a lot. I think the point will be to do it in a side-job solution. We already use resque for that.

I'll also share some of our code with you. maybe there is something we can improve on it.

Thank you so much again

Regards!

boazsegev commented 3 years ago

It takes only 40 seconds to process 3076 pages, but it take 5:30 on the to_pdf method. In production, it's interrupted after 60 seconds.

The other methods don't do much except insert small updates into the data structure. The to_pdf is where all the actual work is performed. Due to Ruby's design, this method requires a lot of reallocations (as the String object grows). If it were C, I would probably be able to calculate the length of the string before allocating memory, but with Ruby I'm not sure how I could reserve a buffer (nor am I sure how I would calculate the required buffer size, but that's solvable).

guilhermemaranhao commented 3 years ago

Thank you. That might be the solution.

boazsegev commented 3 years ago

Perfect :)

Let me know if you need something else.

P.S.

I benchmarked a version of to_pdf that uses a preallocated String buffer (requires Ruby >= 2.4), but it didn't improve performance. It seems that this optimization is already performed by the Array#join method under the hood. So if you have Ruby >= 2.4, you might have better performance.

guilhermemaranhao commented 3 years ago

Hi @boazsegev ,

Do you have an alternative for this issue? A C or python library? Or another Ruby gem?

Thanks again!

boazsegev commented 3 years ago

Hi @guilhermemaranhao ,

Sorry, I don't have another solution. You're handling huge files and they take a lot of CPU, resources and time. The only thing I can think of is a delayed job. Not necessarily with a remote worker, this can be done on the same process, but the HTTP layer isn't a good fit for long-running tasks. It's much better to poll or use WebSockets to push the data.

You can use WebSockets with something similar to this example.

Good luck!

guilhermemaranhao commented 3 years ago

Thank you, @boazsegev