Purpose

The purpose of this Issue is to show you my experiments with athenapdf in a weaver container and asking for help to understand what I am doing wrong.

Thanks

Before I start with the Issue I am having I want to thank you guys for creating such a great tool. It saves a lot of time and effort. Great Job Guys and Gals!

Issue

At first I show you the Task at hand that needs to be fulfilled, my setup with a link to a repository for you to reproduce the problem, followed by the local and remote situation including the errors and the tried Alternative.

Task at hand. The task I try to fulfil is a stable pdf generation environment where I can send 50 almost concurrent (100ms delay) request for conversion that reliable get done without a huge number of concurrently started weaver container.
Setup. In this repository (https://github.com/FrankTheDevop/FrankTheDevop-athenapdf_benchmarking_problem) I have the docker-compose.yml for the local setup and the stackfile for docker cloud deployment. The Environment Variables are set as follow:
- WEAVER_ATHENA_CMD=athenapdf -S --no-portrait -M none
- WEAVER_AUTH_KEY=key
- WEAVER_CONVERSION_FALLBACK=false
- WEAVER_MAX_CONVERSION_QUEUE=50
- WEAVER_MAX_WORKERS=1
- WEAVER_WORKER_TIMEOUT=120
- The main memory is limited to 192 MB. WEAVER_WORKER_TIMEOUT is set to 600 on docker cloud 
Locally it works. The local docker settings are 1 CPU and 1.0 GB main memory. With docker-compose up I start the docker container on my Mac. node benchmark_curl.js executes 50 requests with 100ms pause between each call to weaver. It takes some time but all calls result in a perfectly converted PDF.
Remote on cloud.docker.com/digital ocean it doesn’t work. I used Docker Clouds option to create a node cluster through their interface. Additionally I added a Stack with the context of the stack.yml. After both the machine and the stack are deployed I change the url in benchmark_curl.js and execute it with node benchmark_curl.js. Only the first pdf get converted perfectly. After that follow only small text file with 66 bytes or 92 bytes.
Errors. If only one athenapdf container is running I receive this error message:
- [pdf-1]2017-05-17T11:45:47.163648076Z [GIN] 2017/05/17 - 11:45:47 | 500 [0m| 15m43.68593862s | 141.136.187.226 | [0m GET /convert
- [pdf-1]2017-05-17T11:45:47.163679752Z Error #01: exit status 1 : Xlib: extension "RANDR" missing on display ":99".
- [pdf-1]2017-05-17T11:45:47.163689447Z Xlib: extension "RANDR" missing on display ":99".
- [pdf-1]2017-05-17T11:45:47.163696859Z libudev: udev_has_devtmpfs: name_to_handle_at on /dev: Operation not permitted
- [pdf-1]2017-05-17T11:45:47.163710922Z A Parser-blocking, cross-origin script, https://securepubads.g.doubleclick.net/gpt/pubads_impl_118.js, is invoked via document.write. This may be blocked by the browser if the device has poor network connectivity. See https://www.chromestatus.com/feature/5718547946799104 for more details.
- [pdf-1]2017-05-17T11:45:47.163750471Z A Parser-blocking, cross-origin script, https://securepubads.g.doubleclick.net/gpt/pubads_impl_118.js, is invoked via document.write. This may be blocked by the browser if the device has poor network connectivity. See https://www.chromestatus.com/feature/5718547946799104 for more details.
- [pdf-1]2017-05-17T11:45:47.163875072Z A Parser-blocking, cross-origin script, https://de.ioam.de/tx.io? st=heise&cp=homepage&sv=ke&co=%252F&pt=CP&rf=&r2=&ur=www.heise.de&xy=1024x768x24&lo=DE%2FHessen&cb=000c&vr=312&id=n35gh9&lt=1495021538316&ev=&cs=cuwh3v&mo=1, is invoked via document.write. This may be blocked by the browser if the device has poor network connectivity. See https://www.chromestatus.com/feature/5718547946799104 for more details.
- [pdf-1]2017-05-17T11:45:47.163896971Z A Parser-blocking, cross-origin script, https://de.ioam.de/tx.io?st=heise&cp=homepage&sv=ke&co=%252F&pt=CP&rf=&r2=&ur=www.heise.de&xy=1024x768x24&lo=DE%2FHessen&cb=000c&vr=312&id=n35gh9&lt=1495021538316&ev=&cs=cuwh3v&mo=1, is invoked via document.write. This may be blocked by the browser if the device has poor network connectivity. See https://www.chromestatus.com/feature/5718547946799104 for more details.
- [pdf-1]2017-05-17T11:45:47.163907723Z The renderer process has crashed.
Cutting out the slack this is the root of the problem:
- [pdf-1]2017-05-17T11:45:47.163679752Z Error #01: exit status 1 : Xlib: extension "RANDR" missing on display ":99".
- [pdf-1]2017-05-17T11:45:47.163689447Z Xlib: extension "RANDR" missing on display ":99".
- [pdf-1]2017-05-17T11:45:47.163696859Z libudev: udev_has_devtmpfs: name_to_handle_at on /dev: Operation not permitted

According to Issue #95 (weaver) Renderer Process Crashes when trying to convert some web pages it is a known problem with Electron.

Tried Alternatives. So I tried increasing the memory_limit on the container to test if this helps with Memory freeing problem. It enabled one container to convert multiple pdfs but it is unstable. It gives one perfect conversion and then multiple 504 Timeout errors.

Next I tried starting multiple parallel Instances of this service handled through haproxy ( the stackfile proxy.yml). It made no difference. But when I use the big.yml stackfile it works. It can handle as 120% of the number of parallel sessions. I experimented with it up to 8 entries in the stackfile for the pdfservice, set the number of instances per service to 6 and deployed on 6 nodes. This way I could send 58 almost concurrent calls and one pdf after another got converted and downloaded.

Conclusion

Because of the Garbage Collection Problem with Electron the Queue of weavers isn´t working for me with small number of instances. Only starting from multiple services with multiple definitions on multiple machines with max. 20% more calls than started instances does it work reliably but with high costs.

Please help me understand how I can tune my setup to reduce the number of Services that need to be started. We need a working Queue and a reliable weaver that gives the ability to send 50 concurrent conversion requests and we get the pdfs one after another.

Kind Regards, Frank

With your current settings:

WEAVER_MAX_CONVERSION_QUEUE=50
WEAVER_MAX_WORKERS=1

You'll only be able to convert one document at a time, with a queue size of 50.

In hindsight, the queue size doesn't do much because all web requests are handled concurrently via a Go routine. That is, even without a queue, web requests will not block each other. So it's somewhat of an unnecessary feature (at least I can't remember its usefulness anymore, and that's not a good sign). Even if the queue is overflowing, requests towards the tail end of the queue will just timeout either by the client or at a load balancer level. Thus, the only value that requires tuning is WEAVER_MAX_WORKERS.

That being said, we've had no problem converting ~150 documents in a span of 10s across 3 machines. Our machines aren't that powerful (they're running on t2.small on AWS).

You said you tried increasing the memory_limit on the container. This is good, especially for dealing with Electron crashes, but it doesn't increase the throughput. I suggest raising WEAVER_MAX_WORKERS to about 5-10, based on your specs.

You should probably benchmark how long it takes on average for each document (out of your 50) to be converted. If it takes 5 seconds on average, and you have WEAVER_MAX_WORKERS set to 1, that means you will only make one conversion every 5 seconds. Set that to 2, and you'll have 2 conversions every 5 seconds. Keep bumping it up until you start to notice resource constraints. At that point you have an option of either scaling horizontally (more machines) or vertically (larger machines).

You might also find it useful to mount /dev/shm.

Let me know if that helps.

Hey MrSaints,

thank you for your fast response and sorry for the late reply. This is what i figured out, based on your suggestions.

Setup

I used my develop machine, limited docker to 1 cpu and 3 gb Memory.
Internet Connection is provided from telecom, 20 mbit down, 4 mbit up
The container is started as follows: docker run -m=2G --memory-swap=2G --shm-size=1536m -e WEAVER_WORKER_TIMEOUT=600 -e WEAVER_MAX_WORKERS=5 -e WEAVER_MAX_CONVERSION_QUEUE=100 -e WEAVER_ATHENA_CMD='athenapdf -S --no-portrait -M none' -e WEAVER_AUTH_KEY=key -p 18080:8080 --rm arachnysdocker/athenapdf-service

Result

Sending 50 requests with a timeout of 100ms between each results in 10% of the Documents fail. Increasing the --shm-size parameter further to 2560m didn´t change this.

Error

As before I receive this error:

[pdf-1]2017-05-17T11:45:47.163679752Z Error #01: exit status 1 : Xlib: extension "RANDR" missing on display ":99".
- [pdf-1]2017-05-17T11:45:47.163689447Z Xlib: extension "RANDR" missing on display ":99".
- [pdf-1]2017-05-17T11:45:47.163696859Z libudev: udev_has_devtmpfs: name_to_handle_at on /dev: Operation not permitted

Conclusion

With a single Instance I am unable to achieve a reliable pdf conversion for 50 concurrent requests on a machine with 1 cpu and 3 gb Memory.

Further Ideas

Starting two or three instances together, letting a proxy round robin them may solve the problem. Yet i am unable to increase the /dev/shm size while using docker-compose locally or a stackfile on Docker Cloud. How do you do it on your AWS t2 instances? Do you handle them through a service like Docker Cloud? Or do you do it manually and use docker run?

Kind regards, Frank

Those errors can be safely ignored. They don't seem relevant to your problem.

Did you benchmark how long it takes for a single document to be converted? It'll be great to get some statistics on this, so I can better advise you.
Have you tried mounting /dev/shm:/dev/shm ?
You said 10% of the documents fail. What sort of failures are they? Timeouts?

We are managing the instances via AWS EC2 Container Service (ECS). It should be fairly trivial to get a similar set up with Google Cloud (Kubernetes) or even Docker Cloud since the service is entirely stateless.

Given that you're trying to handle 50 conversions (probably under a minute) reliably, I'd suggest at least load balancing the conversions across 2 instances. I believe it is better to have slightly more resources than needed (redundancy), rather than running on the brink.

Hey MrSaints,

Benchmark

It heavily depended on the number of Workers.

5 Workers took on average 20-40 seconds.
10 Workers took on average 30-60 seconds

The Time begins with the curl call to weaver until the pdf is downloaded locally. One thing I found interesting is, that the time needed for fulfilling one jobs raises with the number of jobs done.

Mounting /dev/shm

Locally I used your container and set shm-size via docker run parameter. So it was mounted with up to 2560mb. Yet I don´t know if that is what you want to know. Remounting in the Container raises a permission denied error.

10% Errors

The resulting files contained mostly the json response from weaver with 500 Internal Server error. Two PDF´s were different in size to the others, they didn´t load the same amount from the target website http://www.heise.de.

AWS EC2 Container Service

I am not familiar with it. How do you mount /dev/shm or specify it´s size there ? Is it a Parameter for the Container? Or do you issue docker run commands there?

2 Instances

I agree that it make sense with load balancing 2 instances. I plan a buffer of 20%, so aiming for 60 concurrent requests for conversion. Currently I fail increasing /dev/shm when using docker-compose locally or a stackfile on Docker Cloud.

Kind regards, Frank

10 Workers took on average 30-60 seconds

Is that per document / URL? If so, that is relatively high latency.

that the time needed for fulfilling one jobs raises with the number of jobs done.

The reason for this observation is that the later jobs are at the back of the queue. It is not necessarily because one job raises the time needed to process the next job. That being said, if the queue is being occupied by slow conversions, they will raise the time (from request start to request end) for subsequent conversions.

If time is not a constraint, I would suggest raising the timeout (both on Weaver, and on the load balancer / reverse proxy in front of the microservice). If it is a constraint, you will have to increase the number of workers so that the queue can be cleared quickly. But, the latter entails resolving bottlenecks at a resource-level (increase your machine's specs). Alternatively, you can scale horizontally (another way of raising the workers).

The resulting files contained mostly the json response from weaver with 500 Internal Server error.

Considering that you are receiving 500 errors, I do not think they are timeout related. It might be worth investigating further as to why you are receiving the 500s. Try enabling GIN_MODE=debug, and perhaps setting the service up with statsd for metrics. If it is too much work, just start with the former, and ignore the latter. Though, I think you should also be tracking metrics for your machine (i.e. CPU, and RAM usage).

Currently I fail increasing /dev/shm when using docker-compose locally or a stackfile on Docker Cloud.

If you're using Docker Compose, you can mount /dev/shm locally using:

volumes:
  - /dev/shm:/dev/shm

In Docker Cloud, it is defined in the same way using the Stack YAML file.

I am not familiar with it. How do you mount /dev/shm or specify it´s size there ? Is it a Parameter for the Container? Or do you issue docker run commands there?

The same can be done with volumes / mount points in AWS ECS:

Hey MrSaints,

I tried your suggestion.

Setup

2 Hosts - 2 GB Main Memory and 2 CPUs each monitoring: image: 'hopsoft/graphite-statsd:latest' mem_limit: 256m ports:

'80:80'
'2003:2003'
'2004:2004'
'2023:2023'
'2024:2024'
'8125:8125/udp'
'8126:8126' restart: always pdf: autoredeploy: true environment:
GIN_MODE=debug
STATSD_ADDRESS=$DockerUrlOfMonitoring[:8125]
STATSD_PREFIX=athenapdf
'VIRTUAL_HOST=$Url'
WEAVER_ATHENA_CMD=athenapdf -S --no-portrait -M none
WEAVER_AUTH_KEY=key
WEAVER_CONVERSION_FALLBACK=false
WEAVER_MAX_CONVERSION_QUEUE=100
WEAVER_MAX_WORKERS=5
WEAVER_WORKER_TIMEOUT=1200 image: 'arachnysdocker/athenapdf-service:2.7.1' mem_limit: 1664m ports:
'8080:8080' restart: always tags:
test target_num_containers: 2 volumes:
'/dev/shm:/dev/shm' proxy: environment:
'STATS_AUTH=stats:stats'
STATS_PORT=1936
'TIMEOUT=connect 5000, client 50000, server 1200000' image: 'dockercloud/haproxy:latest' links:
pdf mem_limit: 64m ports:
'80:80'
'1936:1936' tags:
test

Situation

I received two types of errors, both shown in the Errors section. I tried the STATSD_ADDRESS Environment parameter with port and without, yet I can see anything logged to statsd. The Log output of the weaver container contains no sign if it connects to statsd or not.

Errors

Error 1

2017-05-23T17:54:48.449347899Z Error #01: exit status 1 : Xlib: extension "RANDR" missing on display ":99".
2017-05-23T17:54:48.449353092Z Xlib: extension "RANDR" missing on display ":99".
2017-05-23T17:54:48.449357957Z libudev: udev_has_devtmpfs: name_to_handle_at on /dev: Operation not permitted
2017-05-23T17:54:48.449362484Z
2017-05-23T17:54:48.449366476Z (athenapdf:19): libnotify-WARNING **: Failed to connect to proxy
2017-05-23T17:54:48.449370672Z
2017-05-23T17:54:48.449374493Z (athenapdf:19): libnotify-WARNING **: Failed to connect to proxy
2017-05-23T17:54:48.449378485Z
2017-05-23T17:54:48.449382130Z (athenapdf:19): libnotify-WARNING **: Failed to connect to proxy
2017-05-23T17:54:48.449386178Z The renderer process has crashed.

Error 2

2017-05-23T17:56:01.283301928Z Error #01: signal: segmentation fault (core dumped) : Xlib: extension "RANDR" missing on display ":99".
2017-05-23T17:56:01.283361009Z Xlib: extension "RANDR" missing on display ":99".
2017-05-23T17:56:01.283365939Z libudev: udev_has_devtmpfs: name_to_handle_at on /dev: Operation not permitted
2017-05-23T17:56:01.283369892Z Failed to load: -401 ERR_CACHE_READ_FAILURE (https://www.heise.de/)
2017-05-23T17:56:01.283374291Z

Tried alternatives

I tried running only one machine with one instance of the weaver container. After increasing the timeout on haproxy I still received for 10% the second error shown.

Kind Regards, Frank

Hi @FrankTheDevop,

I appreciate your findings, and the time you have taken to gather them. If you do not mind, can you try using the following tagged image athenapdf-service:3.0.0-b ?

Thanks!

Hi MrSaints,

This time I could only do a quick check. After adapting the parameters to WEAVER_ATHENA_CMD=athenapdf --orientation=Landscape, I get new errors:

Without GIN=DEBUG

2017-07-19T12:56:27.797421373Z [AthenaPDF] converting to PDF: http://www.heise.de 2017-07-19T12:56:42.804653692Z captured errors: 2017-07-19T12:56:42.804944213Z Error #01: exit status 1 : 2017/07/19 12:56:42 Unable to contact debugger at http://localhost:46767/json after 15 seconds, gave up 2017-07-19T12:56:42.804961799Z 2017-07-19T12:56:42.804968000Z 2017-07-19T12:56:42.805604847Z [GIN] 2017/07/19 - 12:56:42 | 500 | 15.156170601s | 124.121.195.211 | GET /convert 2017-07-19T12:56:42.805618455Z Error #01: exit status 1 : 2017/07/19 12:56:42 Unable to contact debugger at http://localhost:46767/json after 15 seconds, gave u sorry for taking so long to answer.

Starting with GIN=DEBUG

2017-07-19T13:58:23.865411879Z [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. 2017-07-19T13:58:23.865476543Z - using env: export GIN_MODE=release 2017-07-19T13:58:23.865486074Z - using code: gin.SetMode(gin.ReleaseMode) 2017-07-19T13:58:23.865490263Z 2017-07-19T13:58:23.867509202Z [GIN-debug] GET /convert --> main.convertByURLHandler (8 handlers) 2017-07-19T13:58:23.867614350Z [GIN-debug] POST /convert --> main.convertByFileHandler (8 handlers) 2017-07-19T13:58:23.867675830Z [GIN-debug] GET / --> main.indexHandler (7 handlers) 2017-07-19T13:58:23.867708455Z [GIN-debug] GET /stats --> main.statsHandler (7 handlers) 2017-07-19T13:58:23.867867872Z [GIN-debug] GET /debug/pprof/ --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.IndexHandler.func1 (7 handlers) 2017-07-19T13:58:23.867968026Z [GIN-debug] GET /debug/pprof/heap --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.HeapHandler.func1 (7 handlers) 2017-07-19T13:58:23.868070939Z [GIN-debug] GET /debug/pprof/goroutine --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.GoroutineHandler.func1 (7 handlers) 2017-07-19T13:58:23.868155466Z [GIN-debug] GET /debug/pprof/block --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.BlockHandler.func1 (7 handlers) 2017-07-19T13:58:23.868238069Z [GIN-debug] GET /debug/pprof/threadcreate --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.ThreadCreateHandler.func1 (7 handlers) 2017-07-19T13:58:23.868320746Z [GIN-debug] GET /debug/pprof/cmdline --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.CmdlineHandler.func1 (7 handlers) 2017-07-19T13:58:23.868414851Z [GIN-debug] GET /debug/pprof/profile --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.ProfileHandler.func1 (7 handlers) 2017-07-19T13:58:23.868520710Z [GIN-debug] GET /debug/pprof/symbol --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.SymbolHandler.func1 (7 handlers) 2017-07-19T13:58:23.868599836Z [GIN-debug] POST /debug/pprof/symbol --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.SymbolHandler.func1 (7 handlers) 2017-07-19T13:58:23.868697930Z [GIN-debug] GET /debug/pprof/trace --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.TraceHandler.func1 (7 handlers) 2017-07-19T13:58:23.868785260Z [GIN-debug] GET /debug/pprof/mutex --> github.com/arachnys/athenapdf/vendor/github.com/DeanThompson/ginpprof.MutexHandler.func1 (7 handlers) 2017-07-19T13:58:23.868845830Z [GIN-debug] Listening and serving HTTP on :8080 2017-07-19T14:01:35.782737870Z [Worker #9] processing conversion job (pending conversions: 0) 2017-07-19T14:01:35.783542488Z [AthenaPDF] converting to PDF: http://www.heise.de 2017-07-19T14:01:50.793221318Z captured errors: 2017-07-19T14:01:50.793763089Z Error #01: exit status 1 : 2017/07/19 14:01:50 Unable to contact debugger at http://localhost:54106/json after 15 seconds, gave up 2017-07-19T14:01:50.793818386Z 2017-07-19T14:01:50.793829501Z 2017-07-19T14:01:50.794655297Z [GIN] 2017/07/19 - 14:01:50 | 500 | 15.157070006s | 85.214.117.208 | GET /convert 2017-07-19T14:01:50.794678728Z Error #01: exit status 1 : 2017/07/19 14:01:50 Unable to contact debugger at http://localhost:54106/json after 15 seconds, gave up 2017-07-19T14:01:50.794688397Z 2017-07-19T14:01:52.201551567Z [GIN] 2017/07/19 - 14:01:52 | 404 | 7.875µs | 85.214.117.208 | GET /favicon.ico

Result

As you can see my tests failed. As I needed to finish the Customer Project I changed the approach. I implemented the usage of Rabbit MQ for Queuing and handling our workflow. If someone has similar struggles and is interested in my solution, I wrote about it here

Kind Regards, Frank

@FrankTheDevop That's really awesome! And I'm so thankful for your continuous feedback. As for the error you have provided, I believe it is related to #119. That is, you might have to run it in privileged mode. I've fixed it recently, but it is not pushed out yet.

we have the same issue which causes athenapdf to terminate

2018/07/06 08:34:05 worker=2 queue=0 fetcher= converter= uploader=
2018/07/06 11:32:42 worker=5 queue=0 fetcher= converter= uploader=
2018/07/06 11:32:57 Unable to contact debugger at http://localhost:46645/json after 15 seconds, gave up

Our ECS task definition

{
  "containerDefinitions": [
    {
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "essential": true,
      "environment": [
        {
          "name": "WEAVER_AUTH_KEY",
          "value": "key"
        },
        {
          "name": "WEAVER_ATHENA_CMD",
          "value": "athenapdf --no-javascript"
        },
        {
          "name": "WEAVER_MAX_WORKERS",
          "value": "10"
        },
        {
          "name": "WEAVER_MAX_CONVERSION_QUEUE",
          "value": "50"
        },
        {
          "name": "WEAVER_WORKER_TIMEOUT",
          "value": "90"
        },
        {
          "name": "WEAVER_CONVERSION_FALLBACK",
          "value": "false"
        }
      ],
      "privileged": true,
      "mountPoints": [{
        "sourceVolume": "shm",
        "containerPath": "/dev/shm"
      }],
      "memoryReservation": 256,
      "name": "pdf-rendering",
      "image": "arachnysdocker/athenapdf-service:3"
    }
  ],
  "volumes": [{
    "name": "shm",
    "host": {
      "sourcePath": "/dev/shm"
    }
  }]
}

arachnys / athenapdf

Problem with 50 concurrent accesses to the Weaver Queue #106

Purpose

Thanks

Issue

Conclusion

Setup

Result

Error

Conclusion

Further Ideas

Benchmark

Mounting /dev/shm

10% Errors

AWS EC2 Container Service

2 Instances

Setup

Situation

Errors

Tried alternatives

Without GIN=DEBUG

Starting with GIN=DEBUG

Result