Closed james-boydell closed 2 weeks ago
@james-boydell, so CPU usage seems to be correctly reported 92+100+85+100=377 (the diff is likely due to measurement timing). The memory usage reported by dstack stats
is indeed misleading. It reports cgroup's memory.usage_in_bytes
, but it would be more sensible to report working_set_memory = memory.usage_in_bytes - cache
. This is what docker stats
and kubectl top
report. We had this metric in the API, so I'm going to fix the CLI output. It should be close to top/htop reporting.
Please note that there can still be discrepancies with htop. htop's reporting may be more accurate but it's when working_set_memory
reported by dstack stats
reaches the max, the container gets OOM-killed.
See also:
To sum up, after the fix dstack stats
should report memory usage the same way as docker stats
and kubectl top
but it may not be the best approach:
usage_in_bytes For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz value for efficient access. (Of course, when necessary, it's synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
It needs to be seen if there are any downsides of using RSS+CACHE(+SWAP) instead if memory.usage_in_bytes besides being different from Kubernetes and Docker.
Hey @r4victor , thanks for looking into this!
For memory stats, I think it's important to report the same way the OOM killer would see it. I think most people will be watching if memory reaches the limit and the container gets killed. This will be important as you work towards issue #1780 and multiple jobs/runs are are on the same node (important for ssh/on prem fleet).
As for CPU, I don't think reporting the sum of all CPU core percentages makes sense as a single metic. If I see more than 100%, I think something is wrong. I'm unsure how you're pulling CPU metrics and I'm more familiar with kubernetes, but reporting the percentage of the CPU limit and/or request would be more useful, or average out the percentage of all cores.
Steps to reproduce
compare htop to dstack stats and the values are incorrect
Actual behaviour
No response
Expected behaviour
No response
dstack version
0.18.18
Server logs
No response
Additional information