In order to enable fine-grain tracing for Alcor, we plan to leverage OSProfiler, Jaeger and Rally to provide a full set of tracing support. The end goal is to generate 1 trace per request, that goes through all involved services including Nova, KeyStone and Alcor. With OSProfiler, this trace can then be extracted and used to build a tree of calls which can be used to isolate cross-service performance issues and locate the performance bottleneck rapidly.
OSProfiler is used for tracing in OpenStack services outside of Alcor.
Jaeger is used for tracing Alcor, a microservices-based distributed systems.
Rally is used to write complex tests scenarios for public cloud customers.
Detailed Requirements
Support major user scenarios for public-cloud customers including but not limited to booting a VM, attaching a VNIC/port to a VM, associating a secondary private IP to a VNIC, creating a VPC/network and Subnet etc.
Stress test each scenario to reach its performance bottleneck
Generate 1 trace per request, that goes through all involved services including Nova, KeyStone and Alcor, and shows a tree of calls (with order of calls, names of involved services and/or sub-services as well as latency for each call) in a single HTTP page
The selected tools for performance profiling should be easy integrated into OpenStack and Alcor.
It shouldn’t require too many changes in code bases of services it’s integrated with.
Easy to fully turn it on for performance tracing or turn if off in production for performance consideration.
Support lazy mode in production (e.g. admin should be able to keep it turned on in lazy mode in production and “trace” on request).
Non Goal
Write a new tracing tool or library. Instead, we leverage state of the art from open-source community that fits into need of Alcor project.
References
https://github.com/openstack/osprofiler provides a tiny but powerful library that is used by most (soon to be all) OpenStack projects and their python clients. It generates 1 trace per request, that goes through all involved services, and builds a tree of calls.
https://github.com/jaegertracing/jaeger is a distributed tracing platform created by Uber Technologies and donated to CNCF. It was graduated in October 2019 and the 7th top-level project in CNCF. It can be used for monitoring microservices-based distributed systems.
https://github.com/openstack/rally: a tool & framework for OpenStack that is capable to perform specific, complicated and reproducible test cases on real deployment scenarios.
High-Level Requests
In order to enable fine-grain tracing for Alcor, we plan to leverage OSProfiler, Jaeger and Rally to provide a full set of tracing support. The end goal is to generate 1 trace per request, that goes through all involved services including Nova, KeyStone and Alcor. With OSProfiler, this trace can then be extracted and used to build a tree of calls which can be used to isolate cross-service performance issues and locate the performance bottleneck rapidly.
Detailed Requirements
Non Goal
References