ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.82k stars 290 forks source link

Memory Stress-ng Test Failing on a System with 881.46GB of RAM #433

Closed mreed8855 closed 1 month ago

mreed8855 commented 1 month ago

I have a system with a large amount of memory that is failing with the stress-ng memory test. It did pass when the amount of memory for the test was reduced. Typically we like stress the system with the maximum amount available. Once the memory was added back in, the same failures occurred. We tried increasing the base timeout and that passed all of the stressors except malloc. I am unsure if this is an actual bug where the system resources cannot keep up or are we being too aggressive with the testcase.

CPU: AMD EPYC 9754 128-Core Processor (Bergamo) Mem: 881 GB 22.04.4 5.15 kernel

Steps to Reproduce sudo add-apt-repository ppa:checkbox-dev/stable sudo apt install canonical-certification-server sudo /usr/lib/checkbox-provider-base/bin/stress_ng_test.py memory

mreed8855 commented 1 month ago

Initially mlock, mremap, shm-sysv, vm-splice, numa, malloc failed

03 Sep 07:34: Running stress-ng mlock stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

03 Sep 11:50: Running stress-ng mremap stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

03 Sep 12:00: Running stress-ng shm-sysv stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

03 Sep 12:10: Running stress-ng vm-splice stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

03 Sep 12:20: Running stress-ng numa stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

-03 Sep 12:30: Running stress-ng malloc stressor for 9115 seconds... ** stress-ng timed out and was forcefully terminated

However, after doubling, tripling and quadrupling the 300 second timeout malloc is the only stressors with an issue.

mreed8855 commented 1 month ago

After increasing the timeout

02 Sep 12:30: Running stress-ng malloc stressor for 9115 seconds... ** stress-ng exited with code 3 stress-ng: info: [964793] setting to a 2 hours, 31 mins, 54 secs run per stressor stress-ng: info: [964793] dispatching hogs: 512 malloc stress-ng: info: [965806] malloc: failed to create counter lock. skipping stressor stress-ng: info: [965809] malloc: failed to create counter lock. skipping stressor stress-ng: info: [965811] malloc: failed to create counter lock. skipping stressor stress-ng: info: [965810] malloc: failed to create counter lock. skipping stressor stress-ng: info: [965812] malloc: failed to create counter lock. skipping stressor stress-ng: warn: [964793] malloc: [965809] aborted early, out of system resources stress-ng: warn: [964793] malloc: [965810] aborted early, out of system resources stress-ng: warn: [964793] malloc: [965811] aborted early, out of system resources stress-ng: warn: [964793] malloc: [965812] aborted early, out of system resources stress-ng: info: [964793] skipped: 4: malloc (4) stress-ng: info: [964793] passed: 507: malloc (507) stress-ng: info: [964793] failed: 0 stress-ng: info: [964793] metrics untrustworthy: 0 stress-ng: info: [964793] successful run completed in 2 hours, 31 mins, 54.52 secs

mreed8855 commented 1 month ago

stress_ng_test.txt Initial Stress-ng memory test run

mreed8855 commented 1 month ago

stress_ng_test-4.txt Stress-ng memory test run with increased base timeout

ColinIanKing commented 1 month ago

Which version of stress-ng is being used? Use stress-ng -V to show the version.

mreed8855 commented 1 month ago

Here is the package version from the submission file. I am waiting on the output of that command. stress-ng 0.18.01-0~202407131132~ubuntu22.04.1

ColinIanKing commented 1 month ago

I'd recommend using the latest version, I've fixed few bugs with vm size measuring in the last 6 months. I've got more recent versions in my PPA: ppa:colin-king/stress-ng

see https://launchpad.net/~colin-king/+archive/ubuntu/stress-ng

ColinIanKing commented 1 month ago

The "malloc: failed to create counter lock. skipping stressor" message is due to the fact that there are many instances of this stressor and each one creates a counter lock. Older versions of stress-ng use a page per lock and this may fail to get allocated as create new stressor instances. The latest version of stress-ng creates a shared page for all the locks, so one has a max of 512 active locks as the upper limit (this itself is actually probably too low, I need to probably provide at least 4K of available concurrent locks).

I've pushed a fix to bump the number of concurrent locks to 2 x max number of stressor instances:


commit 95062984882b5fcec84e541e686222da9b6a20a6 (HEAD -> master, origin/master, origin/HEAD)
Author: Colin Ian King <colin.i.king@gmail.com>
Date:   Thu Oct 3 18:50:02 2024 +0100

    core-lock: increase number of concurrent locks to 2 * STRESS_PROCS_MAX
mreed8855 commented 1 month ago

Thanks for the feedback, I will have them try the latest version.

ColinIanKing commented 1 month ago

Did using a newer version this resolve the issue?

mreed8855 commented 1 month ago

With the new version this issue is still being seen.

apt-cache policy stress-ng

stress-ng: Installed: 0.18.05-1~j0

25 Oct 05:45: Running stress-ng malloc stressor for 7222 seconds... ** stress-ng exited with code 3 stress-ng: info: [2351326] setting to a 2 hours, 22 secs run per stressor stress-ng: info: [2351326] dispatching hogs: 512 malloc stress-ng: info: [2352338] malloc: failed to create counter lock. skipping stressor stress-ng: info: [2352340] malloc: failed to create counter lock. skipping stressor stress-ng: info: [2352342] malloc: failed to create counter lock. skipping stressor stress-ng: info: [2352344] malloc: failed to create counter lock. skipping stressor stress-ng: info: [2352345] malloc: failed to create counter lock. skipping stressor stress-ng: warn: [2351326] malloc: [2352340] aborted early, out of system resources stress-ng: warn: [2351326] malloc: [2352342] aborted early, out of system resources stress-ng: warn: [2351326] malloc: [2352344] aborted early, out of system resources stress-ng: warn: [2351326] malloc: [2352345] aborted early, out of system resources stress-ng: info: [2351326] skipped: 4: malloc (4) stress-ng: info: [2351326] passed: 507: malloc (507) stress-ng: info: [2351326] failed: 0 stress-ng: info: [2351326] metrics untrustworthy: 0 stress-ng: info: [2351326] successful run completed in 2 hours, 22.25 secs

ColinIanKing commented 1 month ago

I'll be releasing V0.18.06 next week, this will incorporate the following fix that will fully address this issue:

commit 95062984882b5fcec84e541e686222da9b6a20a6
Author: Colin Ian King <colin.i.king@gmail.com>
Date:   Thu Oct 3 18:50:02 2024 +0100

    core-lock: increase number of concurrent locks to 2 * STRESS_PROCS_MAX
ColinIanKing commented 1 month ago

This has been fixed in stress-ng release V0.18.06