cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

server: disk I/O metrics reflect noise by other processes outside of k8s (possibly docker?) container #51506

Open knz opened 4 years ago

knz commented 4 years ago

Reported by @bobvawter : when a node is hosted on k8s and there are other pods on the same host that perform disk I/O, the hardware dashboard in the crdb admin UI reflects the IOPS of these other processes.

This causes users to be anxious when they see excess IOPS even though it's not caused by CcokroachDB proper.

This is orthogonal to #48462 .

cc @piyush-singh @ajwerner

Jira issue: CRDB-4037

ajwerner commented 4 years ago

The same can be said about all of the "Hardware" metrics we track today. They are node-level metrics. Perhaps we should be clearer about that as a first step. If the graph excluded load due to other processes, it might hide, say, another process using a bunch of resources that is adversely affecting cockroach.

If we are to address this, I think it'd be by additionally including process-level metrics as opposed to changing this to only be process-level metrics.

I'll admit that the story is more muddled in a containerized environment where there is a certain amount of isolation due to cgroups but not so much isolation that you cannot see the resource usage of other processes. Perhaps we should consider more holistically investing in cgroup-based metric collection which collects usage by reading cgroups and compares it to allocated cgroup limits. The amount of complexity in such an approach at this point is rather unfortunate.

bobvawter commented 4 years ago

It's perfectly reasonable and useful to show the OS-reported stats since other processes can had adverse performance on a CockroachDB instance, but just calling it "hardware" has created (at least) two wild-goose chases for CRL support picking apart debug.zip's when it was just some other random process on the underlying VM wasting resources.

Perhaps as an easy first step, we have the hardware page display a disclaimer along the lines of

These metrics are reported from the operating system and may include the activity from other processes.

knz commented 4 years ago

I think both of you here got side-tracked into a separate topic, which incidentally is that covered by #48462 - "The labels should indicate the metrics reflect OS-level activity, not just IOPS initiated by CockroachDB".

This new issue (#51506) is for a different one: the fact that the OS-level metrics reflect activity by other containers, not just the current one.

I think it's 100% OK for a user to have a "hardware dashboard" that reflects OS-level activity.

However it's also 100% wrong to have metrics that reflect activity not incurred by the current container. Users expect containerization to work, and have metrics reflect what the container does, not what other containers do.

Perhaps we should consider more holistically investing in cgroup-based metric collection which collects usage by reading cgroups and compares it to allocated cgroup limits.

I'm not sure I understand this sentence. The action item here is to figure out whether cgroups also properly expose IOPS, CPU usage and memory usage stats that reflect usage by the container to the exclusion of every other container. And then once we find that, use it inside CockroachDB for everything.

ajwerner commented 4 years ago

For the most part the cgroups do offer usage accounting but the interfaces are completely different from what we currently use for hardware metrics collection. My sense is that we should perform container cgroup resource usage in addition to node-level metrics if only because of poorly understood subtleties of between the different quantities being measured.

Before throwing everything towards this new, as yet unimplemented, cgroup based resource collection, I’d propose that we start doing the cgroup based collection under new time-series metrics and then create a new dashboard Container Resource Usage and relabel the existing one to Node Resource Usage

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

jbowens commented 1 year ago

Related to #104114.