GoogleCloudPlatform / bank-of-anthos

Retail banking sample application showcasing Kubernetes and Google Cloud
https://cymbal-bank.fsi.cymbal.dev
Apache License 2.0
998 stars 588 forks source link

All Python services have very slow HTTP requests #356

Closed djmailhot closed 4 years ago

djmailhot commented 4 years ago

After fiddling with this for a long while, I'm inclined to believe this unexpected latency is common to all our Python services.

I inserted HTTP calls to the /ready endpoint for the contactservice and the userservice, and both calls have large latencies even though the function is simply return "ok", 200. frontend_trace_ready

Originally posted by @djmailhot in https://github.com/GoogleCloudPlatform/bank-of-anthos/issues/295#issuecomment-695057813

djmailhot commented 4 years ago

So far ruled out the following:

SQL latency

Hypothesis: SQL calls from the Python services are slow. Calls not using SQL should be fast.

Result: Calls to /ready still have average ~500ms latencies.

Threading & Scaling issues

Hypothesis: Flask and Gunicorn aren't configured for maximum performance

Result: No effect. This latency remains even when turning off the load generator and making one request a minute.

K8 image

Hypothesis: Python Dockerfiles use an image with baggage gcr.io/google-appengine/python

Result: No effect.

askmeegs commented 4 years ago

I spent some time debugging this, and have determined that the Cloud Trace Exporter for OpenTelemetry is very likely causing the latency, which is ironic. :)

All my notes are here.

My steps to find the problem:

  1. I noticed that bank-of-anthos.xyz (running v0.2.0 from May) is running much faster than master. I went back into the commit history and tested a bunch of commits, using the time command to ping the frontend, and found that the OT Python Tracing commit (early July) is when the problems (> 3-4s of latency) began.
  2. I knew that the Cloud Trace library for OT moved repos / had a newer version, so to make sure we weren't stuck on an old/slow version, I went ahead and upgraded all the OT packages (OT, OT + GCP, Flask instr., SQLAlchemy instr.) to the latest version, 0.13b0. Latency issues persisted after upgrade.
  3. I systematically commented out instrumentation across all the services, eg. remove Flask instrumentation from all the services, remove the SQLAlchemy instrumentation. Eventually I found that commenting out the 2 lines that enable Cloud Trace export...
        # Set up tracing and export spans to Cloud Trace
        trace.set_tracer_provider(TracerProvider())
        # cloud_trace_exporter = CloudTraceSpanExporter()
        # trace.get_tracer_provider().add_span_processor(
        #     SimpleExportSpanProcessor(cloud_trace_exporter)
        # )
        set_global_textmap(CloudTraceFormatPropagator())
        FlaskInstrumentor().instrument_app(app)

....Improved the latency dramatically. For example, with those two lines uncommented, pinging the frontend averaged 5.12 seconds

➜  ~ hey -n 100 -c 10 http://104.196.36.38:80

Summary:
  Total:    53.4578 secs
  Slowest:  7.4205 secs
  Fastest:  3.5342 secs
  Average:  5.1236 secs
  Requests/sec: 1.8706

When I comment out those lines, effectively turning off trace export, we see 86ms:

➜  ~ hey -n 100 -c 10 http://104.196.36.38:80

Summary:
  Total:    0.9168 secs
  Slowest:  0.3131 secs
  Fastest:  0.0224 secs
  Average:  0.0865 secs
  Requests/sec: 109.0697

  Total data:   658900 bytes
  Size/request: 6589 bytes
askmeegs commented 4 years ago

Update - thanks to a helpful tip from the OpenTelemetry Python + GCP folks, the culprit is not the exporter itself but the span processor we're using -- SimpleExportSpanProcessor works synchronously. Switching to BatchExportSpanProcessor seems to have largely fixed the latency problem. 🎉

➜  ~ hey -n 100 -c 10 http://104.196.36.38

Summary:
  Total:    1.2084 secs
  Slowest:  0.6937 secs
  Fastest:  0.0242 secs
  Average:  0.1018 secs
  Requests/sec: 82.7556

  Total data:   658900 bytes
  Size/request: 6589 bytes

And requests to frontend / (home) are down to ~100ms, versus multiple seconds. :D

Screen Shot 2020-09-21 at 7 46 51 PM