buchgr / bazel-remote

A remote cache for Bazel
https://bazel.build
Apache License 2.0
596 stars 155 forks source link

What is the correct way to track hit rate? #478

Open putnap opened 3 years ago

putnap commented 3 years ago

Hey,

I am fixing our remote bazel cache for our monorepos and I inherited monitoring dashboards which had incorrect way of checking hit/miss ratio: sum(bazel_remote_disk_cache_hits + bazel_remote_http_cache_hits) / sum(bazel_remote_disk_cache_hits + bazel_remote_http_cache_hits + bazel_remote_disk_cache_misses + bazel_remote_http_cache_misses) which doesn't provide overtime data and resets once server is restarted.

I am now on latest version of the release and looking for a correct way to track this metric. We are not using any 2nd layer solutions. Perhaps using something similar to sum(rate(grpc_server_handled_total{service="cache", grpc_code="OK"}[1h])) / sum(rate(grpc_server_handled_total{service="cache"}[1h])) would work? I just don't have any experience with prometheus on how to set up queries to see hit/miss percentage over time, so I don't know if this is correct in any way and it's not an easy task to confirm the metrics are correct.

Our goal is to have at least 75% hit rate and make sure we get a warning if it drops below.

https://github.com/buchgr/bazel-remote/pull/472#issuecomment-919758689 contains what I want to see, but from what I read, I assume that there are custom code added to the docker image to be able to produce such metrics.

Any help would be appreciated!

EDIT: What I am looking for is a stable metric that can show real time value that I could use to throw alerts of. I assume that rate period should be selected accordingly. Also not sure which metric is best to use.

mostynb commented 3 years ago

Hi, unfortunately I think bazel-remote's current metrics aren't good enough for your needs. Hopefully that will change soon (my target is 2-3 weeks) - I'm working on this in #472, hopefully ending up with a setup like the custom bazel-remote metrics in that comment you linked to, but that might require some followup PRs.

I don't think checking for grpc_code="OK" will give correct results, because there are rpcs that are not related to cache hits/misses, and mixing stats for different rpc calls (Get/Put/Contains) and different entry types (ActionResult, CAS) will give fairly useless data.

One thing you can try in the meantime is client-side metrics. At the end of each build, count bazel's cache hits / misses and report those numbers and/or the hit rate somehow.

putnap commented 3 years ago

@mostynb thanks for your answer. What is the current recommended approach to tracking hit/miss ratio?

mostynb commented 3 years ago

One way that I use in .bazelci/system-test.sh is to run bazel with --execution_log_json_file and then extract the number of hits and misses from that after the build finishes. You could try something similar, then save the results somewhere that you can track over time.

I don't recommend using the code from .bazelci/system-test.sh however- it uses a bit of a hack to find the results without properly parsing the json file.

putnap commented 3 years ago

Thank you for your answers. I was hoping for server side tracking, this is sad news to me which might make us look for a new solution.

mostynb commented 3 years ago

bazel-remote has some hit/miss prometheus metrics, but all the request types (action cache, CAS) are mixed up so the numbers aren't particularly useful right now. High hit:miss ratios are generally a good sign, maybe that's sufficient for now?

472 will improve things soon (I hope).

putnap commented 2 years ago

@mostynb I saw that there was a new release with changes to prometheus tracking included. Can you provide some documentation on how we could setup our cache metrics now with new changes?

mostynb commented 2 years ago

I think the most important metric is the action cache hit rate, so for a particular time period you'd want to count {status=="hit" && kind=="ac" && (method=="get" || method=="contains")} and divide by {kind=="ac" && (method=="get" || method=="contains")} - I haven't had time to learn how to configure prometheus dashboards yet, but hopefully it's easy to hook this up if you already have that knowledge?

Documentation update PRs are very welcome of course (or if you can describe it for me I can add it).