arachnys / athenapdf

Drop-in replacement for wkhtmltopdf built on Go, Electron and Docker
MIT License
2.26k stars 186 forks source link

Problem with 50 concurrent accesses to the Weaver Queue #106

Open FrankTheDevop opened 7 years ago

FrankTheDevop commented 7 years ago

Purpose

The purpose of this Issue is to show you my experiments with athenapdf in a weaver container and asking for help to understand what I am doing wrong.

Thanks

Before I start with the Issue I am having I want to thank you guys for creating such a great tool. It saves a lot of time and effort. Great Job Guys and Gals!

Issue

At first I show you the Task at hand that needs to be fulfilled, my setup with a link to a repository for you to reproduce the problem, followed by the local and remote situation including the errors and the tried Alternative.

According to Issue #95 (weaver) Renderer Process Crashes when trying to convert some web pages it is a known problem with Electron.

Next I tried starting multiple parallel Instances of this service handled through haproxy ( the stackfile proxy.yml). It made no difference. But when I use the big.yml stackfile it works. It can handle as 120% of the number of parallel sessions. I experimented with it up to 8 entries in the stackfile for the pdfservice, set the number of instances per service to 6 and deployed on 6 nodes. This way I could send 58 almost concurrent calls and one pdf after another got converted and downloaded.

Conclusion

Because of the Garbage Collection Problem with Electron the Queue of weavers isn´t working for me with small number of instances. Only starting from multiple services with multiple definitions on multiple machines with max. 20% more calls than started instances does it work reliably but with high costs.

Please help me understand how I can tune my setup to reduce the number of Services that need to be started. We need a working Queue and a reliable weaver that gives the ability to send 50 concurrent conversion requests and we get the pdfs one after another.

Kind Regards, Frank

MrSaints commented 7 years ago

With your current settings:

WEAVER_MAX_CONVERSION_QUEUE=50
WEAVER_MAX_WORKERS=1

You'll only be able to convert one document at a time, with a queue size of 50.

In hindsight, the queue size doesn't do much because all web requests are handled concurrently via a Go routine. That is, even without a queue, web requests will not block each other. So it's somewhat of an unnecessary feature (at least I can't remember its usefulness anymore, and that's not a good sign). Even if the queue is overflowing, requests towards the tail end of the queue will just timeout either by the client or at a load balancer level. Thus, the only value that requires tuning is WEAVER_MAX_WORKERS.

That being said, we've had no problem converting ~150 documents in a span of 10s across 3 machines. Our machines aren't that powerful (they're running on t2.small on AWS).

You said you tried increasing the memory_limit on the container. This is good, especially for dealing with Electron crashes, but it doesn't increase the throughput. I suggest raising WEAVER_MAX_WORKERS to about 5-10, based on your specs.

You should probably benchmark how long it takes on average for each document (out of your 50) to be converted. If it takes 5 seconds on average, and you have WEAVER_MAX_WORKERS set to 1, that means you will only make one conversion every 5 seconds. Set that to 2, and you'll have 2 conversions every 5 seconds. Keep bumping it up until you start to notice resource constraints. At that point you have an option of either scaling horizontally (more machines) or vertically (larger machines).

You might also find it useful to mount /dev/shm.

Let me know if that helps.

FrankTheDevop commented 7 years ago

Hey MrSaints,

thank you for your fast response and sorry for the late reply. This is what i figured out, based on your suggestions.

Setup

Result

Sending 50 requests with a timeout of 100ms between each results in 10% of the Documents fail. Increasing the --shm-size parameter further to 2560m didn´t change this.

Error

As before I receive this error:

Conclusion

With a single Instance I am unable to achieve a reliable pdf conversion for 50 concurrent requests on a machine with 1 cpu and 3 gb Memory.

Further Ideas

Starting two or three instances together, letting a proxy round robin them may solve the problem. Yet i am unable to increase the /dev/shm size while using docker-compose locally or a stackfile on Docker Cloud. How do you do it on your AWS t2 instances? Do you handle them through a service like Docker Cloud? Or do you do it manually and use docker run?

Kind regards, Frank

MrSaints commented 7 years ago

Those errors can be safely ignored. They don't seem relevant to your problem.

We are managing the instances via AWS EC2 Container Service (ECS). It should be fairly trivial to get a similar set up with Google Cloud (Kubernetes) or even Docker Cloud since the service is entirely stateless.

Given that you're trying to handle 50 conversions (probably under a minute) reliably, I'd suggest at least load balancing the conversions across 2 instances. I believe it is better to have slightly more resources than needed (redundancy), rather than running on the brink.

FrankTheDevop commented 7 years ago

Hey MrSaints,

Benchmark

It heavily depended on the number of Workers.

The Time begins with the curl call to weaver until the pdf is downloaded locally. One thing I found interesting is, that the time needed for fulfilling one jobs raises with the number of jobs done.

Mounting /dev/shm

Locally I used your container and set shm-size via docker run parameter. So it was mounted with up to 2560mb. Yet I don´t know if that is what you want to know. Remounting in the Container raises a permission denied error.

10% Errors

The resulting files contained mostly the json response from weaver with 500 Internal Server error. Two PDF´s were different in size to the others, they didn´t load the same amount from the target website http://www.heise.de.

AWS EC2 Container Service

I am not familiar with it. How do you mount /dev/shm or specify it´s size there ? Is it a Parameter for the Container? Or do you issue docker run commands there?

2 Instances

I agree that it make sense with load balancing 2 instances. I plan a buffer of 20%, so aiming for 60 concurrent requests for conversion. Currently I fail increasing /dev/shm when using docker-compose locally or a stackfile on Docker Cloud.

Kind regards, Frank

MrSaints commented 7 years ago

10 Workers took on average 30-60 seconds

Is that per document / URL? If so, that is relatively high latency.

that the time needed for fulfilling one jobs raises with the number of jobs done.

The reason for this observation is that the later jobs are at the back of the queue. It is not necessarily because one job raises the time needed to process the next job. That being said, if the queue is being occupied by slow conversions, they will raise the time (from request start to request end) for subsequent conversions.

If time is not a constraint, I would suggest raising the timeout (both on Weaver, and on the load balancer / reverse proxy in front of the microservice). If it is a constraint, you will have to increase the number of workers so that the queue can be cleared quickly. But, the latter entails resolving bottlenecks at a resource-level (increase your machine's specs). Alternatively, you can scale horizontally (another way of raising the workers).

The resulting files contained mostly the json response from weaver with 500 Internal Server error.

Considering that you are receiving 500 errors, I do not think they are timeout related. It might be worth investigating further as to why you are receiving the 500s. Try enabling GIN_MODE=debug, and perhaps setting the service up with statsd for metrics. If it is too much work, just start with the former, and ignore the latter. Though, I think you should also be tracking metrics for your machine (i.e. CPU, and RAM usage).

Currently I fail increasing /dev/shm when using docker-compose locally or a stackfile on Docker Cloud.

If you're using Docker Compose, you can mount /dev/shm locally using:

volumes:
  - /dev/shm:/dev/shm

In Docker Cloud, it is defined in the same way using the Stack YAML file.

I am not familiar with it. How do you mount /dev/shm or specify it´s size there ? Is it a Parameter for the Container? Or do you issue docker run commands there?

The same can be done with volumes / mount points in AWS ECS:

screen shot 2017-05-22 at 12 24 49 screen shot 2017-05-22 at 12 25 05
FrankTheDevop commented 7 years ago

Hey MrSaints,

I tried your suggestion.

Setup

2 Hosts - 2 GB Main Memory and 2 CPUs each monitoring: image: 'hopsoft/graphite-statsd:latest' mem_limit: 256m ports:

Situation

I received two types of errors, both shown in the Errors section. I tried the STATSD_ADDRESS Environment parameter with port and without, yet I can see anything logged to statsd. The Log output of the weaver container contains no sign if it connects to statsd or not.

Errors

Error 1

Error 2

Tried alternatives

I tried running only one machine with one instance of the weaver container. After increasing the timeout on haproxy I still received for 10% the second error shown.

Kind Regards, Frank

MrSaints commented 7 years ago

Hi @FrankTheDevop,

I appreciate your findings, and the time you have taken to gather them. If you do not mind, can you try using the following tagged image athenapdf-service:3.0.0-b ?

Thanks!

FrankTheDevop commented 7 years ago

Hi MrSaints,

This time I could only do a quick check. After adapting the parameters to WEAVER_ATHENA_CMD=athenapdf --orientation=Landscape, I get new errors:

Without GIN=DEBUG

2017-07-19T12:56:27.797421373Z [AthenaPDF] converting to PDF: http://www.heise.de 2017-07-19T12:56:42.804653692Z captured errors: 2017-07-19T12:56:42.804944213Z Error #01: exit status 1 : 2017/07/19 12:56:42 Unable to contact debugger at http://localhost:46767/json after 15 seconds, gave up 2017-07-19T12:56:42.804961799Z 2017-07-19T12:56:42.804968000Z 2017-07-19T12:56:42.805604847Z [GIN] 2017/07/19 - 12:56:42 | 500 | 15.156170601s | 124.121.195.211 | GET /convert 2017-07-19T12:56:42.805618455Z Error #01: exit status 1 : 2017/07/19 12:56:42 Unable to contact debugger at http://localhost:46767/json after 15 seconds, gave u sorry for taking so long to answer.

Starting with GIN=DEBUG

2017-07-19T13:58:23.865411879Z [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. 2017-07-19T13:58:23.865476543Z - using env: export GIN_MODE=release 2017-07-19T13:58:23.865486074Z - using code: gin.SetMode(gin.ReleaseMode) 2017-07-19T13:58:23.865490263Z 2017-07-19T13:58:23.867509202Z [GIN-debug] GET /convert --> main.convertByURLHandler (8 handlers) 2017-07-19T13:58:23.867614350Z [GIN-debug] POST /convert --> main.convertByFileHandler (8 handlers) 2017-07-19T13:58:23.867675830Z [GIN-debug] GET / --> main.indexHandler (7 handlers) 2017-07-19T13:58:23.867708455Z [GIN-debug] GET /stats --> main.statsHandler (7 handlers) 2017-07-19T13:58:23.867867872Z [GIN-debug] GET /debug/pprof/ --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.IndexHandler.func1 (7 handlers) 2017-07-19T13:58:23.867968026Z [GIN-debug] GET /debug/pprof/heap --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.HeapHandler.func1 (7 handlers) 2017-07-19T13:58:23.868070939Z [GIN-debug] GET /debug/pprof/goroutine --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.GoroutineHandler.func1 (7 handlers) 2017-07-19T13:58:23.868155466Z [GIN-debug] GET /debug/pprof/block --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.BlockHandler.func1 (7 handlers) 2017-07-19T13:58:23.868238069Z [GIN-debug] GET /debug/pprof/threadcreate --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.ThreadCreateHandler.func1 (7 handlers) 2017-07-19T13:58:23.868320746Z [GIN-debug] GET /debug/pprof/cmdline --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.CmdlineHandler.func1 (7 handlers) 2017-07-19T13:58:23.868414851Z [GIN-debug] GET /debug/pprof/profile --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.ProfileHandler.func1 (7 handlers) 2017-07-19T13:58:23.868520710Z [GIN-debug] GET /debug/pprof/symbol --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.SymbolHandler.func1 (7 handlers) 2017-07-19T13:58:23.868599836Z [GIN-debug] POST /debug/pprof/symbol --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.SymbolHandler.func1 (7 handlers) 2017-07-19T13:58:23.868697930Z [GIN-debug] GET /debug/pprof/trace --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.TraceHandler.func1 (7 handlers) 2017-07-19T13:58:23.868785260Z [GIN-debug] GET /debug/pprof/mutex --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.MutexHandler.func1 (7 handlers) 2017-07-19T13:58:23.868845830Z [GIN-debug] Listening and serving HTTP on :8080 2017-07-19T14:01:35.782737870Z [Worker #9] processing conversion job (pending conversions: 0) 2017-07-19T14:01:35.783542488Z [AthenaPDF] converting to PDF: http://www.heise.de 2017-07-19T14:01:50.793221318Z captured errors: 2017-07-19T14:01:50.793763089Z Error #01: exit status 1 : 2017/07/19 14:01:50 Unable to contact debugger at http://localhost:54106/json after 15 seconds, gave up 2017-07-19T14:01:50.793818386Z 2017-07-19T14:01:50.793829501Z 2017-07-19T14:01:50.794655297Z [GIN] 2017/07/19 - 14:01:50 | 500 | 15.157070006s | 85.214.117.208 | GET /convert 2017-07-19T14:01:50.794678728Z Error #01: exit status 1 : 2017/07/19 14:01:50 Unable to contact debugger at http://localhost:54106/json after 15 seconds, gave up 2017-07-19T14:01:50.794688397Z 2017-07-19T14:01:52.201551567Z [GIN] 2017/07/19 - 14:01:52 | 404 | 7.875µs | 85.214.117.208 | GET /favicon.ico

Result

As you can see my tests failed. As I needed to finish the Customer Project I changed the approach. I implemented the usage of Rabbit MQ for Queuing and handling our workflow. If someone has similar struggles and is interested in my solution, I wrote about it here

Kind Regards, Frank

MrSaints commented 7 years ago

@FrankTheDevop That's really awesome! And I'm so thankful for your continuous feedback. As for the error you have provided, I believe it is related to #119. That is, you might have to run it in privileged mode. I've fixed it recently, but it is not pushed out yet.

ozonni commented 6 years ago

we have the same issue which causes athenapdf to terminate

2018/07/06 08:34:05 worker=2 queue=0 fetcher= converter= uploader=
2018/07/06 11:32:42 worker=5 queue=0 fetcher= converter= uploader=
2018/07/06 11:32:57 Unable to contact debugger at http://localhost:46645/json after 15 seconds, gave up

Our ECS task definition

{
  "containerDefinitions": [
    {
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "essential": true,
      "environment": [
        {
          "name": "WEAVER_AUTH_KEY",
          "value": "key"
        },
        {
          "name": "WEAVER_ATHENA_CMD",
          "value": "athenapdf --no-javascript"
        },
        {
          "name": "WEAVER_MAX_WORKERS",
          "value": "10"
        },
        {
          "name": "WEAVER_MAX_CONVERSION_QUEUE",
          "value": "50"
        },
        {
          "name": "WEAVER_WORKER_TIMEOUT",
          "value": "90"
        },
        {
          "name": "WEAVER_CONVERSION_FALLBACK",
          "value": "false"
        }
      ],
      "privileged": true,
      "mountPoints": [{
        "sourceVolume": "shm",
        "containerPath": "/dev/shm"
      }],
      "memoryReservation": 256,
      "name": "pdf-rendering",
      "image": "arachnysdocker/athenapdf-service:3"
    }
  ],
  "volumes": [{
    "name": "shm",
    "host": {
      "sourcePath": "/dev/shm"
    }
  }]
}