canonical / checkbox

Checkbox
https://checkbox.readthedocs.io
GNU General Public License v3.0
30 stars 47 forks source link

Memory Stress-ng Test Failing on a System with 881.46GB of RAM #1508

Open mreed8855 opened 3 hours ago

mreed8855 commented 3 hours ago

Bug Description

I have a system with a large amount of memory that is failing with the stress-ng memory test. I did pass when the amount of memory for the test was reduced. I am unsure if this is an actual bug or are we being too aggressive with the testcase.

To Reproduce

sudo add-apt-repository ppa:checkbox-dev/stable sudo apt install canonical-certification-server sudo /usr/lib/checkbox-provider-base/bin/stress_ng_test.py memory

Environment

-Noble
CPU: AMD EPYC 9754 128-Core Processor Mem: 881 GB

Relevant log output

No response

Additional context

Cisco Systems - C245 M8 https://certification.canonical.com/certificates/2407-15946/

I have the same issue open on Stress-ng https://bugs.launchpad.net/stress-ng/+bug/2082743 No response

syncronize-issues-to-jira[bot] commented 3 hours ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/CHECKBOX-1591.

This message was autogenerated

mreed8855 commented 3 hours ago

Initially mlock, mremap, shm-sysv, vm-splice, numa, malloc failed

03 Sep 07:34: Running stress-ng mlock stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

03 Sep 11:50: Running stress-ng mremap stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

03 Sep 12:00: Running stress-ng shm-sysv stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

03 Sep 12:10: Running stress-ng vm-splice stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

03 Sep 12:20: Running stress-ng numa stressor for 300 seconds... ** stress-ng timed out and was forcefully terminated

-03 Sep 12:30: Running stress-ng malloc stressor for 9115 seconds... ** stress-ng timed out and was forcefully terminated

However, after doubling, tripling and quadrupling the 300 second timeout malloc and mlock are the only stressors with issues.

mreed8855 commented 3 hours ago

After increasing the timeout

02 Sep 12:30: Running stress-ng malloc stressor for 9115 seconds... ** stress-ng exited with code 3 stress-ng: info: [964793] setting to a 2 hours, 31 mins, 54 secs run per stressor stress-ng: info: [964793] dispatching hogs: 512 malloc stress-ng: info: [965806] malloc: failed to create counter lock. skipping stressor stress-ng: info: [965809] malloc: failed to create counter lock. skipping stressor stress-ng: info: [965811] malloc: failed to create counter lock. skipping stressor stress-ng: info: [965810] malloc: failed to create counter lock. skipping stressor stress-ng: info: [965812] malloc: failed to create counter lock. skipping stressor stress-ng: warn: [964793] malloc: [965809] aborted early, out of system resources stress-ng: warn: [964793] malloc: [965810] aborted early, out of system resources stress-ng: warn: [964793] malloc: [965811] aborted early, out of system resources stress-ng: warn: [964793] malloc: [965812] aborted early, out of system resources stress-ng: info: [964793] skipped: 4: malloc (4) stress-ng: info: [964793] passed: 507: malloc (507) stress-ng: info: [964793] failed: 0 stress-ng: info: [964793] metrics untrustworthy: 0 stress-ng: info: [964793] successful run completed in 2 hours, 31 mins, 54.52 secs

mreed8855 commented 3 hours ago

After increasing the timeout

19 Sep 17:45: Running stress-ng mmap stressor for 30527 seconds... stress-ng: info: [1565947] setting to a 8 hours, 28 mins, 46 secs run per stressor stress-ng: info: [1565947] dispatching hogs: 192 mmap stress-ng: warn: [1565948] cannot terminate process 1565956, gave up after 120 seconds stress-ng: warn: [1565953] cannot terminate process 1565965, gave up after 120 seconds ...

stress-ng: warn: [1566323] cannot terminate process 1566328, gave up after 120 seconds stress-ng: warn: [1566326] cannot terminate process 1566331, gave up after 120 seconds stress-ng: info: [1565947] skipped: 0 stress-ng: info: [1565947] passed: 191: mmap (191) stress-ng: info: [1565947] failed: 0 stress-ng: info: [1565947] metrics untrustworthy: 0 stress-ng: info: [1565947] successful run completed in 8 hours, 28 mins, 52.11 secs

mreed8855 commented 3 hours ago

I have seen this pass on a system with 1T of RAM https://certification.canonical.com/hardware/202202-29934/submission/396782/