cloud-bulldozer / benchmark-wrapper

Python Library to run benchmarks
https://benchmark-wrapper.readthedocs.io
Apache License 2.0
19 stars 56 forks source link

Uperf test fails but does not exit with error #399

Closed dry923 closed 2 years ago

dry923 commented 2 years ago

When uperf runs and fails the execution does not return an error code. Take the below output for an example. The test failed with CRITICAL errors and yet the main snafu process feeds back a 0 return code. This results in the benchmark-operator continuing on and any automation reads it as a success.

...
2021-12-03T15:04:15Z - INFO     - MainProcess - _benchmark: Running setup tasks.
2021-12-03T15:04:15Z - INFO     - MainProcess - _benchmark: Collecting results from benchmark.
2021-12-03T15:04:15Z - INFO     - MainProcess - uperf: Collecting 3 samples of Uperf
2021-12-03T15:04:15Z - INFO     - MainProcess - process: Collecting 3 samples of command ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-64-64-1']
2021-12-03T15:06:26Z - WARNING  - MainProcess - process: Got bad return code from command: ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-64-64-1'].
2021-12-03T15:08:37Z - WARNING  - MainProcess - process: Got bad return code from command: ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-64-64-1'].
2021-12-03T15:10:48Z - WARNING  - MainProcess - process: Got bad return code from command: ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-64-64-1'].
2021-12-03T15:10:48Z - CRITICAL - MainProcess - process: After 3 attempts, unable to run command: ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-64-64-1']
2021-12-03T15:10:48Z - WARNING  - MainProcess - process: Sample 1 has failed state for command ['uperf', '-v', '-a', '-R', '-i', '1', '-m', '/tmp/uperf-test/uperf-stream-tcp-64-64-1']
2021-12-03T15:10:48Z - CRITICAL - MainProcess - uperf: Uperf failed to run! Got results: ProcessSample(expected_rc=0, success=False, attempts=3, timeout=None, failed=[ProcessRun(rc=1, stdout='Error getting SSL CTX:1\nAllocating shared memory of size 156624 bytes\nError connecting to 10.0.128.2\n\n** TCP: Cannot connect to 10.0.128.2:20000 Connection timed out\n', stderr='', time_seconds=130.935195, hit_timeout=False), ProcessRun(rc=1, stdout='Error getting SSL CTX:1\nAllocating shared memory of size 156624 bytes\nError connecting to 10.0.128.2\n\n** TCP: Cannot connect to 10.0.128.2:20000 Connection timed out\n', stderr='', time_seconds=131.074382, hit_timeout=False), ProcessRun(rc=1, stdout='Error getting SSL CTX:1\nAllocating shared memory of size 156624 bytes\nError connecting to 10.0.128.2\n\n** TCP: Cannot connect to 10.0.128.2:20000 Connection timed out\n', stderr='', time_seconds=131.068473, hit_timeout=False)], successful=None)
2021-12-03T15:10:48Z - INFO     - MainProcess - _benchmark: Cleaning up
2021-12-03T15:10:48Z - INFO     - MainProcess - run_snafu: Indexed results - 0 success, 0 duplicates, 0 failures, with 0 retries.
2021-12-03T15:10:48Z - INFO     - MainProcess - run_snafu: Duration of execution - 0:06:33, with total size of 0 bytes
1

# oc get pods
NAME                                                           READY   STATUS    RESTARTS   AGE
benchmark-controller-manager-7c9ff9796b-d4pjn                  2/2     Running   0          9m33s
uperf-client-10.0.128.2-1338d118--1-hscdj                      1/1     Running   0          7m12s
uperf-server-dry-gcp2-6c86n-worker-a-rnh-0-1338d118--1-qtl9g   1/1     Running   0          7m43s

...

# oc get pods
NAME                                            READY   STATUS      RESTARTS   AGE
benchmark-controller-manager-7c9ff9796b-d4pjn   2/2     Running     0          16m
uperf-client-10.0.128.2-1338d118--1-hscdj       0/1     Completed   0          14m

We capture the failure here (https://github.com/cloud-bulldozer/benchmark-wrapper/blob/b9ddc342647ab994a87353aa7721147101f99428/snafu/benchmarks/uperf/uperf.py#L331) but never do anything with the bad result except log it.

dry923 commented 2 years ago

cc @morenod this is the reason for our hostnetwork tests passing even though they should fail.

vishnuchalla commented 2 years ago

Related PRs