compute: Add jemalloc metrics

lluki commented 1 year ago

Feature request

Jemalloc is not returning memory to the OS very eagerly which leads to some anxiety when looking at a pod's memory consumption, because it does not decrease immediately as expected. According to the slack discussion the consensus is that we want to expose some of the allocator stats to our metrics system.

Ideally we add this to all processes (environmentd/storaged/computed).

Which metrics, exactly?

See the jemalloc manpage for an exact definition. We most likely want the following that we also capture in the JemallocStats structs

Jemalloc exposes additionally the following scalars:

stats.metadata_thp (size_t)
    Number of transparent huge pages (THP) used for metadata. See stats.metadata and opt.metadata_thp) for details.

stats.mapped (size_t)
    Total number of bytes in active extents mapped by the allocator. This is larger than stats.active. This does not include inactive extents, even those that contain unused dirty pages, which means that there is no strict ordering between this and stats.resident.

stats.zero_reallocs (size_t)
    Number of times that the realloc() was called with a non-NULL pointer argument and a 0 size argument. This is a fundamentally unsafe pattern in portable programs; see opt.zero_realloc for details. 

stats.background_thread.num_threads (size_t)
    Number of background threads running currently.

stats.background_thread.num_runs (uint64_t)
    Total number of runs from all background threads.

stats.background_thread.run_interval (uint64_t)
    Average run interval in nanoseconds of background threads.

And then there are detailed statistics for each arena times bins and for each mutex which are probably less useful and reporting every single one of them would add thousands of values.

Where do we want to see them?

Definitely Grafana, somewhere else?

Which mechanism to use?

Options are: 1) /metrics 2) using our internal http servers endpoint (stats are accessible as curl -H "Accept: application/json" "http://localhost:6878/prof/?action=dump_stats" ) 3) A introspection source in mz_internal and then fetch it with our prometheus sql exporter. 4) OpenTelemetry 5) Any other mechanism I'm not aware of

My gut feeling is 1) or 2) is the right way to go. Regarding 2) I don't know how hard it is to teach Grafana to fetch data from an additional http endpoint.

3) has the benefit that customer's can look at it, but it's hard to implement as we need not only per cluster but also per process information and there is no infrastructure yet for exposing that type of information. It would also require different mechanics/table for computed, environmentd and storaged.

4) is probably not suitable.

lluki commented 1 year ago

According to https://github.com/MaterializeInc/cloud/issues/4239 @jpepin is leading the dashboard effort. Do you know how hard it would be to tell grafana to do a curl -H "Accept: application/json" "http://localhost:6878/prof/?action=dump_stats" and then extract some fields out of the returned json?

jpepin commented 1 year ago

@lluki - Just one point of clarification -- Grafana is a UI for metrics dashboarding that can plug into lots of data stores! The main ones we use right now are Prometheus and Tempo. So we probably want to think about what data stores or plugins we have available to us that can ingest this data.

Does option 1. /metrics imply that this information can be surfaced as a Prometheus metric? That would be the easiest and quickest way to get the metrics displayed in Grafana from a cloud point of view. For option 2, if there is a way we could get those metrics in Prometheus format, that would also be relatively simple to scrape with Prometheus.

If neither of those is possible, we could start looking at plugins to either convert the metrics output or do some parsing. I did a quick search and found some conversion and scraping libraries, but those would all be more work than just getting these metrics into Prometheus, so I'd like to start by ruling that out.

benesch commented 1 year ago

Agree with @jpepin! I strongly recommend that we figure out how to expose the jemalloc stats as Prometheus metrics in each process.

I suspect it's as straightforward as adding utility function to, say, mz_prof that:

Registers Prometheus metrics for the stats we care about into a MetricsRegistry
Spawns a Tokio task that periodically (say, every 30s) fetches stats from jemalloc and updates the Prometheus metrics accordingly.

And then we just need to call this method from each binary, and we're done! Prometheus will automatically start scraping the metrics from /metrics.

lluki commented 1 year ago

Thanks! I think what @benesch outlined is quite easy to implement, I was just wondering if it would be even easier to use the already existing endpoint. Let's go with option 1)

lluki commented 1 year ago

After discussion with @antiguru still investigated option 2) :

Over option 1), option 2) has the benefit of being able to access any stats produced by jemalloc without any code changes or restarts of any monitored process. We just need to update a config file if we need to see more stats.

We can use json_exporter that would turn the json endpoint into prometheus format.

From the cloud side we'd need to do the following things:

json_exporter needs to be run alongside with our other processes. The config can be static (example config below) and only one process is needed per deployment. Docker images are available.

For materialized
- scrape http://json_exporter:7979/probe?module=default&target=http://materialize:6878/prof/?action=dump_stats
- add labels accordingly, as with scraping materialized's /metrics directly
For each process in {computed, storage}:
- scrape http://json_exporter:7979/probe?module=default&target=http://target:6878/?action=dump_stats
- add labels accordingly

json_exporter config: materialize-config.yml.txt

umanwizard commented 1 year ago

I talked to Lukas a bit about this and we identified a few tradeoffs:

(1) If we use the /metrics endpoint, we will have to either export all the stats from jemalloc (which has a ton of stuff we don't need), or we'll have to have a list of metrics that we can only change when deploying a new version of Materialize (since we don't have the ability yet to make configuration changes between releases). Maybe LaunchDarkly will help with this; discussing here in Slack.

(2) If we use the json-exporter, we need to do the work on the Cloud side to launch that pod for each environment (CC @jpepin : Do you have a sense on how much work this would be?). We would also need to figure out if json-exporter has a way of dynamically discovering services to scrape, because it needs to be able to query new storage/compute pods as they come online.

umanwizard commented 1 year ago

@benesch pointed out something that might make option (1) a lot more attractive: you can actually allow/deny metrics in Prometheus's config itself: see https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config .

So we could publish all metrics from Materialize and then filter them down to a restricted set in Prometheus, and change that configuration without waiting for a DB release.

benesch commented 1 year ago

Yeah, exactly! Seems like the best of all worlds. No additional services to run, but also don’t need to wait for a release cycle to reconfigure what metrics are scraped.

On Wed, Dec 14, 2022 at 12:34 PM umanwizard @.***> wrote:

@benesch https://github.com/benesch pointed out something that might make option (1) a lot more attractive: you can actually allow/deny metrics in Prometheus's config itself: see https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config .

So we could publish all metrics from Materialize and then filter them down to a restricted set in Prometheus, and change that configuration without waiting for a DB release.

— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/16383#issuecomment-1351829078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIEWETUMD7NF4UKK373WNIALFANCNFSM6AAAAAASPUDDHI . You are receiving this because you were mentioned.Message ID: @.***>

jpepin commented 1 year ago

re: (2) for posterity -- not a ton of work just to launch it, but it requires updating the environment-controller. And then it's one more service that the environment-controller would have to keep track of updating, and we'd have to monitor it for health on top of whatever service discovery it would require to do its job, etc etc. Option (1) with exporting all metrics and filtering on the Prometheus side seems much easier all around.

umanwizard commented 1 year ago

I think we can try that, and if it doesn't work (e.g. because there are so many metrics that Prometheus can't cope), then we can re-evaluate. @lluki and @antiguru , what do you think?

lluki commented 1 year ago

I think we can try that, and if it doesn't work (e.g. because there are so many metrics that Prometheus can't cope), then we can re-evaluate. @lluki and @antiguru , what do you think?

:+1: from my side. I'm still unsure about the utility of having the per arena stats in the metrics anyway

lluki commented 1 year ago

The tikv jemalloc crate we are using exposes only active, allocated, metadata, resident and retained from the jemalloc stats (source).

We can: 1) use the metrics we get from the crate only 2) get the json formatted output and parse it 3) modify the jemalloc-ctl crate to include all the stats field (the handy macros that wrap the unsafe access to the stats struct are unfortunately not pub).

I'm in favor of doing 1) then see how far we can get with this. If we see there's more need, we can still add more later.

MaterializeInc / materialize