Studiosity / grover

A Ruby gem to transform HTML into PDFs, PNGs or JPEGs using Google Puppeteer/Chromium
MIT License
931 stars 107 forks source link

CPU/memory usage #60

Closed vitobotta closed 8 months ago

vitobotta commented 4 years ago

Hi! Thanks a lot for this gem! It's working beautifully.

I'm using it in a Docker container in Kubernetes. If multiple jobs are processed at the same time (Sidekiq worker with N threads), does the gem start as many browsers to take the screen shots?

I am concerned about resources usage.

Thanks in advance for any clarification 🙂

abrom commented 4 years ago

hi @vitobotta, the simple answer is a likely yes.

The gem simply calls out to the puppeteer library, which in turn will launch which ever browser you've specified (by default it'd launch Chromium). In terms of the browser process management it'd no doubt come down to your choice of browser and how that browser is launched, what flags are specified and it's default behaviours. Bit of a non-answer I know.. but it depends!

I can say that by default v3.x of puppeteer will include the --disable-backgrounding-occluded-windows and --disable-renderer-backgrounding options (may also be the case for older versions). AFAIK those prevent any of the renderer processes from hanging around and as such it will clean up everything on exit.

I can see there is an option provided by puppeteer that would allow for these default options to be removed, however grover doesn't currently provide a mechanism for that option to be passed through. See https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions => launchDefaultArgs

From here, I'd suggest you try do some benchmarking with your setup. There are going to be a lot of factors that will affect memory usage specific to your use case and setup so not a lot I can do there. You're welcome to fork the project and add support for the launchDefaultArgs parameter. It should be pretty straight forward if you take the executablePath option as a template. I can't say how disabling those options would make puppeteer/chromium behave so you're likely going to have to do some testing! I'd welcome a PR if you find that option helps your situation.

vitobotta commented 4 years ago

Hi @abrom and thanks for your reply :) I switched to browserless.io because during my testing I got high memory usage alerts a few times already and it was just me testing. My Kubernetes cluster is 3 nodes with 8GB each and I have several things installed; with just me simulating a few simultaneous users triggering screenshots I had problems with some pods. The cluster resources are for now fine otherwise. For that reason I decided to outsource this since I don't want to risk to either spend more by adding more capacity, or to DoS myself with a few users triggering this functionality :)

phikes commented 4 years ago

We also ran into issues with resources and grover. I'd actually love to work on something that might help with that:

I would like to have grover leverage puppeteer-cluster. It would need some more involved changes (e.g. grover would need to start the cluster upon startup). Before I start working on that I just wanted to check if it's something that you would consider merging, @abrom :)

Cheers and thanks for the gem!

abrom commented 4 years ago

It's difficult to comment not knowing how you're actually using the gem. I usually create a few thousand PDFs a day using a mix of the middleware and calling the libraries directly and haven't seen any memory issues. It could depend on the type of content you're loading too.

In any case, Grover is a thin wrapper around Puppeteer, which for the most part is simply a wrapper around what ever browser driver you're using (by default Chromium). If you are getting memory leakage issues from the browser then directing your questions there would likely get better results.

In terms of using a pool of workers it feels like that is somewhat a niche problem being solved. For example that would likely not solve the issue raised by @vitobotta given the calls were coming out of a Sidekiq worker. It would really only work in that sort of scenario if there was some sort of singleton service capable of orchestrating various async requests. Happy to talk it through, but it would seem a better fit as a fork.

I would also question whether using a worker pool would necessarily solve the problem. Chromium already has worker process management cooked in, so if the problem is your requests are some how bloating out Chromium, then it would seem logical that the same issue would persist no matter how the requests were being made?

phikes commented 4 years ago

You raise some very good points, @abrom and you are right it would be a kind of niche in our case.

The problem we are seeing is that we run into the maximum number of threads Heroku allows. The way I saw it is that we are starting a lot of processes when generating renders. I thought it would be good to use the cluster to restrict and queue up "tasks" for puppeteer.

I will look into solving this on the webserver level.

abrom commented 8 months ago

I'm going to close this as there hasn't been any movement here for a while. I will suggest that the browser_ws_endpoint option might be of use in this case though. It allows Grover to connect to a remote Chromium instance, but also doesn't close the browser, only the page, which might provide a performance bump