Closed spacejam closed 9 years ago
cc @smarterclayton @jeremyeder
@xiang90 you mentioned per-key metrics were already exposed somewhere?
@jeremyeder Sorry about that. I misunderstood the question.
@spacejam We already have an etcd proxy today. Do you think it is possible to add an analysis feature into the proxy?
That's one place you could certainly add some instrumentation, but it is useless for the purposes of finding out things like "why is my production cluster being taken down by client load?" unless clients are already using that proxy, because it involves a significant amount of on-the-fly reconfiguration. Here's a super naive thing I just whipped up that uses libpcap and pulls out the topK urls being hit along with their verbs, which can be used on arbitrary production systems with just a bit of overhead added by the kernel adding reference pointers for each sk_buff: https://github.com/mesosphere/etcd-top
Maybe you could use misra-gries on the request stream though for a nice cheap simple probabilistic ranking of most popular operations on popular keys? There are a few ways do improve visibility in a cheap way.
The current etcd-top tool has output that looks like:
$ go run etcd-top.go --period=1 -topk=3
Top 3 most popular http requests:
Sum Rate Verb Path
6310 243 PUT /v2/keys/foo
422 16 GET /v2/members
8 0 PUT /v2/keys/21
Top 3 slowest individual http requests:
Time Request
60.008692ms PUT /v2/keys/foo
31.642128ms PUT /v2/keys/58
31.327283ms PUT /v2/keys/2
Top 3 total time spent in requests:
Time Request
5.766000298s PUT /v2/keys/foo
207.272109ms GET /v2/members
31.642128ms PUT /v2/keys/58
Top 3 heaviest http requests:
Content-Length Request
59873 PUT /v2/keys/foo
15 PUT /v2/keys/13
15 PUT /v2/keys/30
Overall request size stats:
Total requests sniffed: 14320
Content Length Min: 0
Content Length 50th: 15
Content Length 75th: 59873
Content Length 90th: 59873
Content Length 95th: 59873
Content Length 99th: 59873
Content Length 99.9th: 59873
Content Length 99.99th: 59873
Content Length Max: 59873
This is adequate for simple analysis I think. It's super naive and may have some memory leaks, so is not sufficient for long-term analysis (I'll be writing a separate prometheus exporter for the kubernetes scalability efforts that will be better for that sort of thing). Would you like me to submit a PR to add it to the tools directory, @xiang90 ?
@spacejam Do you want to continue to improve and maintain this tool? It is not a hard requirement. But I just want to make it a little bit clear.
Also we need to plan for v3 api, which is a gRPC based one.
@xiang90 I am happy to fix bugs, but the time I can guarantee for improvement is limited. I feel that what is covered currently would be sufficient for the kind of analysis I had in mind when I opened this issue.
I don't think gRPC will be too difficult to add though, just need to implement request timers as a chan per socket, naive TCP stream reconstruction, and a small state machine that decodes frames. This sounds like a fun improvement that I'm happy to work on but cannot make a commitment to a specific timeline.
@spacejam
I am happy to fix bugs
I would like to merge in this tool if you would like to fix bugs.
@spacejam @xiang90 I am also looking for such kind of tools, I can help fixing bugs if possible.
@spacejam Would you like to push the tool to /tools in etcd repo.
I think @AdoHe , @raoofm and I would love to maintain this tool with you. Thanks!
We added an initial impl in https://github.com/coreos/etcd/pull/3790. We will iterate on that implementation.
A type of tool that has gotten me out of several production outages is a realtime workload analyzer, such as redis-faina or memkeys for memcached.
It would be nice to have such an external tool for observing the production usage patterns of an etcd server, to facilitate rapid diagnosis and workload analysis.