Closed djmailhot closed 4 years ago
Hypothesis: SQL calls from the Python services are slow. Calls not using SQL should be fast.
/ready
endpoints from the homepage (See branch frontend-test-ready
)Result: Calls to /ready
still have average ~500ms latencies.
Hypothesis: Flask and Gunicorn aren't configured for maximum performance
threaded=True
workers=5 threads=2
Result: No effect. This latency remains even when turning off the load generator and making one request a minute.
Hypothesis: Python Dockerfiles use an image with baggage gcr.io/google-appengine/python
FROM python:3
Result: No effect.
I spent some time debugging this, and have determined that the Cloud Trace Exporter for OpenTelemetry is very likely causing the latency, which is ironic. :)
My steps to find the problem:
time
command to ping the frontend, and found that the OT Python Tracing commit (early July) is when the problems (> 3-4s of latency) began. 0.13b0
. Latency issues persisted after upgrade. # Set up tracing and export spans to Cloud Trace
trace.set_tracer_provider(TracerProvider())
# cloud_trace_exporter = CloudTraceSpanExporter()
# trace.get_tracer_provider().add_span_processor(
# SimpleExportSpanProcessor(cloud_trace_exporter)
# )
set_global_textmap(CloudTraceFormatPropagator())
FlaskInstrumentor().instrument_app(app)
....Improved the latency dramatically. For example, with those two lines uncommented, pinging the frontend averaged 5.12 seconds
➜ ~ hey -n 100 -c 10 http://104.196.36.38:80
Summary:
Total: 53.4578 secs
Slowest: 7.4205 secs
Fastest: 3.5342 secs
Average: 5.1236 secs
Requests/sec: 1.8706
When I comment out those lines, effectively turning off trace export, we see 86ms:
➜ ~ hey -n 100 -c 10 http://104.196.36.38:80
Summary:
Total: 0.9168 secs
Slowest: 0.3131 secs
Fastest: 0.0224 secs
Average: 0.0865 secs
Requests/sec: 109.0697
Total data: 658900 bytes
Size/request: 6589 bytes
Update - thanks to a helpful tip from the OpenTelemetry Python + GCP folks, the culprit is not the exporter itself but the span processor we're using -- SimpleExportSpanProcessor
works synchronously. Switching to BatchExportSpanProcessor
seems to have largely fixed the latency problem. 🎉
➜ ~ hey -n 100 -c 10 http://104.196.36.38
Summary:
Total: 1.2084 secs
Slowest: 0.6937 secs
Fastest: 0.0242 secs
Average: 0.1018 secs
Requests/sec: 82.7556
Total data: 658900 bytes
Size/request: 6589 bytes
And requests to frontend /
(home) are down to ~100ms, versus multiple seconds. :D
After fiddling with this for a long while, I'm inclined to believe this unexpected latency is common to all our Python services.
I inserted HTTP calls to the /ready endpoint for the contactservice and the userservice, and both calls have large latencies even though the function is simply
return "ok", 200
.Originally posted by @djmailhot in https://github.com/GoogleCloudPlatform/bank-of-anthos/issues/295#issuecomment-695057813