dstackai / dstack

dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.42k stars 150 forks source link

[Bug]: incorrect stats #1852

Closed james-boydell closed 2 weeks ago

james-boydell commented 2 weeks ago

Steps to reproduce

compare htop to dstack stats and the values are incorrect

Actual behaviour

No response

Expected behaviour

No response

dstack version

0.18.18

Server logs

No response

Additional information

image image
r4victor commented 2 weeks ago

@james-boydell, so CPU usage seems to be correctly reported 92+100+85+100=377 (the diff is likely due to measurement timing). The memory usage reported by dstack stats is indeed misleading. It reports cgroup's memory.usage_in_bytes, but it would be more sensible to report working_set_memory = memory.usage_in_bytes - cache. This is what docker stats and kubectl top report. We had this metric in the API, so I'm going to fix the CLI output. It should be close to top/htop reporting.

Please note that there can still be discrepancies with htop. htop's reporting may be more accurate but it's when working_set_memory reported by dstack stats reaches the max, the container gets OOM-killed.

See also:

r4victor commented 2 weeks ago

To sum up, after the fix dstack stats should report memory usage the same way as docker stats and kubectl top but it may not be the best approach:

usage_in_bytes For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz value for efficient access. (Of course, when necessary, it's synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).

https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

It needs to be seen if there are any downsides of using RSS+CACHE(+SWAP) instead if memory.usage_in_bytes besides being different from Kubernetes and Docker.

james-boydell commented 2 weeks ago

Hey @r4victor , thanks for looking into this!

For memory stats, I think it's important to report the same way the OOM killer would see it. I think most people will be watching if memory reaches the limit and the container gets killed. This will be important as you work towards issue #1780 and multiple jobs/runs are are on the same node (important for ssh/on prem fleet).

As for CPU, I don't think reporting the sum of all CPU core percentages makes sense as a single metic. If I see more than 100%, I think something is wrong. I'm unsure how you're pulling CPU metrics and I'm more familiar with kubernetes, but reporting the percentage of the CPU limit and/or request would be more useful, or average out the percentage of all cores.