NCEAS / metadig-engine

MetaDig Engine: multi-dialect metadata assessment engine
7 stars 5 forks source link

Nginx "502 Bad Gateway" #130

Open gothub opened 6 years ago

gothub commented 6 years ago

When many requests are send to k8s metadig, Nginx response with either:

504 Gateway Time-out or 502 Bad Gateway

The former message is received when 1000 requests are sent in less than 60 seconds. The later message is more common when 1000 requests are sent, but a 1 second pause is use after every 10 request.

It appears that Nginx or the connection itself is becoming saturated.

BTW, the full test involves sending 10000 unique metadata documents to be quality scored.This test will be run as soon as these Gateway problems have been resolved.

gothub commented 6 years ago

Note that the "Gateway TIme-out" message is being received by the client (a Python script) that is sending requests to k8s. The NGINX instance (the k8s ingress) however, is printing msgs like these:

2018/06/12 18:03:18 [error] 1878#1878: *896 upstream timed out (110: Connection timed out) while connecting to upstream, client: 192.168.25.64, server: docker-ucsb-1.test.dataone.org, request: "POST /metadig-webapp/suites/knb.suite.1/run HTTP/1.1", upstream: "http://192.168.158.24:80/metadig-webapp/suites/knb.suite.1/run", host: "docker-ucsb-1.test.dataone.org:30080"

which looks like the Apache container is the bottleneck.

The current configuration is "k8s ingress (NGINX)" -> Apache2 -> Apache Tomcat

The Apache2 container isn't really required here, as the k8s NGINX ingress can provide SSL/TLS termination and routing, so the next thing to try is a direct connection between NGINX and Tomcat.

gothub commented 6 years ago

Update: the Apache2 container has been removed, so that NGINX is sending requests directly to the 8080 port of the container running Apache Tomcat. The metadig-engine controller is running in this container.

The '... upstream timed out' messages are still being printed by NGINX, however, all quality document requests are being processed and indexed into Solr.

I suspect that connections between NGINX and Tomcat are staying open, then timing out. It appears that the response from Tomcat isn't being received by NGINX.