autowarefoundation / autoware.universe

https://autowarefoundation.github.io/autoware.universe/
Apache License 2.0
882 stars 570 forks source link

flaky test: test_fault_injection_node.test.py #7772

Open xmfcx opened 1 week ago

xmfcx commented 1 week ago

Checklist

Description

https://github.com/autowarefoundation/autoware.universe/actions/runs/9743179438/job/26886066688#step:16:11812

It has been passing successfully for a long while but it failed randomly here.

FAIL: test_receive_multiple_message_simultaneously (fault_injection.TestFaultInjectionLink)

2: Traceback (most recent call last):
2:   File "/__w/autoware.universe/autoware.universe/simulator/fault_injection/test/test_fault_injection_node.test.py", line 175, in test_receive_multiple_message_simultaneously
2:     self.assertEqual(stat.level, DiagnosticStatus.ERROR)
2: AssertionError: b'\x00' != b'\x02'

Failure line: https://github.com/autowarefoundation/autoware.universe/blob/c1dbd5bf938fbdd9e91505cc5344020fbf4ea752/simulator/fault_injection/test/test_fault_injection_node.test.py#L175

History: image

cc. @KeisukeShima @TomohitoAndo

Expected behavior

Test should have passed.

Actual behavior

But it failed.

Steps to reproduce

colcon test --event-handlers console_cohesion+ --mixin coverage-pytest --packages-select fault_injection

This command passed on my machine and also passed in CI many times.

Update: See this comment: https://github.com/autowarefoundation/autoware.universe/issues/7772#issuecomment-2200020894

But it failed in this one instance here.

Versions

No response

Possible causes

Maybe we can increase evaluation_time from 0.5s to 2s to compensate for slow test environment.

https://github.com/autowarefoundation/autoware.universe/blob/c1dbd5bf938fbdd9e91505cc5344020fbf4ea752/simulator/fault_injection/test/test_fault_injection_node.test.py#L70

Additional context

No response

xmfcx commented 1 week ago

Ok I can reproduce it too:

sudo apt install stress

Terminal 1

stress -c 24 -m 200

(I have 12c24t CPU and 64GB RAM, adjust to your system)

Terminal 2

colcon test --event-handlers console_cohesion+ --mixin coverage-pytest --packages-select fault_injection

🖱️Click here to expand🔛 ```console mfc@mfc-leo:~/projects/autoware$ colcon test --event-handlers console_cohesion+ --mixin coverage-pytest --packages-select fault_injection Starting >>> fault_injection [37.076s] WARNING:colcon.colcon_ros.prefix_path.ament:The path '/home/mfc/projects/autoware/install/lidar_centerpoint_tvm' in the environment variable AMENT_PREFIX_PATH doesn't exist [37.155s] WARNING:colcon.colcon_ros.prefix_path.catkin:The path '/home/mfc/projects/autoware/install/lidar_centerpoint_tvm' in the environment variable CMAKE_PREFIX_PATH doesn't exist [Processing: fault_injection] --- output: fault_injection UpdateCTestConfiguration from :/home/mfc/projects/autoware/build/fault_injection/CTestConfiguration.ini Parse Config file:/home/mfc/projects/autoware/build/fault_injection/CTestConfiguration.ini Site: mfc-leo Build name: (empty) Add coverage exclude regular expressions. Create new tag: 20240701-1226 - Experimental UpdateCTestConfiguration from :/home/mfc/projects/autoware/build/fault_injection/CTestConfiguration.ini Parse Config file:/home/mfc/projects/autoware/build/fault_injection/CTestConfiguration.ini Test project /home/mfc/projects/autoware/build/fault_injection Constructing a list of tests Done constructing a list of tests Updating test list for fixtures Added 0 tests to meet fixture requirements Checking test dependency graph... Checking test dependency graph end test 1 Start 1: test_fault_injection_node_component 1: Test command: /usr/bin/python3.10 "-u" "/opt/ros/humble/share/ament_cmake_ros/cmake/run_test_isolated.py" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_node_component.gtest.xml" "--package-name" "fault_injection" "--output-file" "/home/mfc/projects/autoware/build/fault_injection/ament_cmake_gtest/test_fault_injection_node_component.txt" "--command" "/home/mfc/projects/autoware/build/fault_injection/test_fault_injection_node_component" "--gtest_output=xml:/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_node_component.gtest.xml" 1: Test timeout computed to be: 60 1: Running with ROS_DOMAIN_ID 1 1: -- run_test.py: invoking following command in '/home/mfc/projects/autoware/build/fault_injection': 1: - /home/mfc/projects/autoware/build/fault_injection/test_fault_injection_node_component --gtest_output=xml:/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_node_component.gtest.xml 1: [==========] Running 2 tests from 1 test suite. 1: [----------] Global test environment set-up. 1: [----------] 2 tests from TestDiagnosticStorage 1: [ RUN ] TestDiagnosticStorage.call_update_with_correct_key 1: [ OK ] TestDiagnosticStorage.call_update_with_correct_key (0 ms) 1: [ RUN ] TestDiagnosticStorage.raise_exception_with_wrong_key 1: [ OK ] TestDiagnosticStorage.raise_exception_with_wrong_key (0 ms) 1: [----------] 2 tests from TestDiagnosticStorage (0 ms total) 1: 1: [----------] Global test environment tear-down 1: [==========] 2 tests from 1 test suite ran. (1 ms total) 1: [ PASSED ] 2 tests. 1: -- run_test.py: return code 0 1: -- run_test.py: inject classname prefix into gtest result file '/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_node_component.gtest.xml' 1: -- run_test.py: verify result file '/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_node_component.gtest.xml' 1/6 Test #1: test_fault_injection_node_component ... Passed 5.34 sec test 2 Start 2: test_fault_injection_launch_test 2: Test command: /usr/bin/python3.10 "-u" "/home/mfc/projects/autoware/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_launch_test.xunit.xml" "--package-name" "fault_injection" "--output-file" "/home/mfc/projects/autoware/build/fault_injection/launch_test/test_fault_injection_launch_test.txt" "--command" "/usr/bin/python3" "-m" "launch_testing.launch_test" "/home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/test/test_fault_injection_node.test.py" "--junit-xml=/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_launch_test.xunit.xml" "--package-name=fault_injection" 2: Test timeout computed to be: 60 2: -- run_test.py: invoking following command in '/home/mfc/projects/autoware/build/fault_injection': 2: - /usr/bin/python3 -m launch_testing.launch_test /home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/test/test_fault_injection_node.test.py --junit-xml=/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_launch_test.xunit.xml --package-name=fault_injection 2: [INFO] [launch]: All log files can be found below /home/mfc/.ros/log/2024-07-01-15-26-50-491627-mfc-leo-3958620 2: [INFO] [launch]: Default logging verbosity is set to INFO 2: test_node_link (fault_injection.TestFaultInjectionLink) 2: Test node linkage. ... [INFO] [fault_injection_node-1]: process started with pid [3959103] 2: [fault_injection_node-1] [DEBUG 1719836817.466051133] [fault_injection]: Parameter: cpu_load_average, value: : CPU Load Average (read_event_diag_list() at /home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/src/fault_injection_node/fault_injection_node.cpp:99) 2: [fault_injection_node-1] [DEBUG 1719836817.468677231] [fault_injection]: Parameter: cpu_temperature, value: : CPU Temperature (read_event_diag_list() at /home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/src/fault_injection_node/fault_injection_node.cpp:99) 2: [fault_injection_node-1] [DEBUG 1719836817.469807464] [fault_injection]: Parameter: cpu_usage, value: : CPU Usage (read_event_diag_list() at /home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/src/fault_injection_node/fault_injection_node.cpp:99) 2: FAIL 2: test_receive_multiple_message_simultaneously (fault_injection.TestFaultInjectionLink) 2: Test for https://github.com/autowarefoundation/autoware.universe/pull/4042. ... [fault_injection_node-1] [DEBUG 1719836818.216154680] [fault_injection]: Received data: stamp: 2: [fault_injection_node-1] sec: 0 2: [fault_injection_node-1] nanosec: 0 2: [fault_injection_node-1] fault_injection_events: 2: [fault_injection_node-1] - 2: [fault_injection_node-1] level: 2 2: [fault_injection_node-1] name: "cpu_temperature" 2: [fault_injection_node-1] (on_simulation_events() at /home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/src/fault_injection_node/fault_injection_node.cpp:67) 2: [fault_injection_node-1] [DEBUG 1719836818.216765752] [fault_injection]: Received data: stamp: 2: [fault_injection_node-1] sec: 0 2: [fault_injection_node-1] nanosec: 0 2: [fault_injection_node-1] fault_injection_events: 2: [fault_injection_node-1] - 2: [fault_injection_node-1] level: 2 2: [fault_injection_node-1] name: "cpu_usage" 2: [fault_injection_node-1] (on_simulation_events() at /home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/src/fault_injection_node/fault_injection_node.cpp:67) 2: ok 2: 2: ====================================================================== 2: FAIL: test_node_link (fault_injection.TestFaultInjectionLink) 2: Test node linkage. 2: ---------------------------------------------------------------------- 2: Traceback (most recent call last): 2: File "/home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/test/test_fault_injection_node.test.py", line 128, in test_node_link 2: self.assertGreaterEqual(self.get_num_valid_data(msg_buffer, DiagnosticStatus.ERROR), 1) 2: AssertionError: 0 not greater than or equal to 1 2: 2: ---------------------------------------------------------------------- 2: Ran 2 tests in 2.715s 2: 2: FAILED (failures=1) 2: [INFO] [fault_injection_node-1]: sending signal 'SIGINT' to process[fault_injection_node-1] 2: [fault_injection_node-1] [INFO 1719836818.978081363] [rclcpp]: signal_handler(signum=2) (signal_handler() at ./src/rclcpp/signal_handler.cpp:71) 2: [INFO] [fault_injection_node-1]: process has finished cleanly [pid 3959103] 2: test_exit_code (fault_injection.TestProcessOutput) 2: Test process exit code. ... ok 2: 2: ---------------------------------------------------------------------- 2: Ran 1 test in 0.001s 2: 2: OK 2: -- run_test.py: return code 1 2: -- run_test.py: verify result file '/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/test_fault_injection_launch_test.xunit.xml' 2/6 Test #2: test_fault_injection_launch_test ......***Failed 14.91 sec test 3 Start 3: copyright 3: Test command: /usr/bin/python3.10 "-u" "/home/mfc/projects/autoware/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/copyright.xunit.xml" "--package-name" "fault_injection" "--output-file" "/home/mfc/projects/autoware/build/fault_injection/ament_copyright/copyright.txt" "--command" "/opt/ros/humble/bin/ament_copyright" "--xunit-file" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/copyright.xunit.xml" 3: Test timeout computed to be: 200 3: -- run_test.py: invoking following command in '/home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection': 3: - /opt/ros/humble/bin/ament_copyright --xunit-file /home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/copyright.xunit.xml 3: No problems found, checked 7 files 3: -- run_test.py: return code 0 3: -- run_test.py: verify result file '/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/copyright.xunit.xml' 3/6 Test #3: copyright ............................. Passed 10.76 sec test 4 Start 4: cppcheck 4: Test command: /usr/bin/python3.10 "-u" "/home/mfc/projects/autoware/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/cppcheck.xunit.xml" "--package-name" "fault_injection" "--output-file" "/home/mfc/projects/autoware/build/fault_injection/ament_cppcheck/cppcheck.txt" "--command" "/opt/ros/humble/bin/ament_cppcheck" "--xunit-file" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/cppcheck.xunit.xml" "--include_dirs" "/home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/include" 4: Test timeout computed to be: 300 4: -- run_test.py: invoking following command in '/home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection': 4: - /opt/ros/humble/bin/ament_cppcheck --xunit-file /home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/cppcheck.xunit.xml --include_dirs /home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection/include 4: cppcheck 2.7 has known performance issues and therefore will not be used, set the AMENT_CPPCHECK_ALLOW_SLOW_VERSIONS environment variable to override this. 4: -- run_test.py: return code 0 4: -- run_test.py: verify result file '/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/cppcheck.xunit.xml' 4/6 Test #4: cppcheck .............................. Passed 5.31 sec test 5 Start 5: lint_cmake 5: Test command: /usr/bin/python3.10 "-u" "/home/mfc/projects/autoware/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/lint_cmake.xunit.xml" "--package-name" "fault_injection" "--output-file" "/home/mfc/projects/autoware/build/fault_injection/ament_lint_cmake/lint_cmake.txt" "--command" "/opt/ros/humble/bin/ament_lint_cmake" "--xunit-file" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/lint_cmake.xunit.xml" 5: Test timeout computed to be: 60 5: -- run_test.py: invoking following command in '/home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection': 5: - /opt/ros/humble/bin/ament_lint_cmake --xunit-file /home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/lint_cmake.xunit.xml 5: 5: No problems found 5: -- run_test.py: return code 0 5: -- run_test.py: verify result file '/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/lint_cmake.xunit.xml' 5/6 Test #5: lint_cmake ............................ Passed 4.44 sec test 6 Start 6: xmllint 6: Test command: /usr/bin/python3.10 "-u" "/home/mfc/projects/autoware/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/xmllint.xunit.xml" "--package-name" "fault_injection" "--output-file" "/home/mfc/projects/autoware/build/fault_injection/ament_xmllint/xmllint.txt" "--command" "/opt/ros/humble/bin/ament_xmllint" "--xunit-file" "/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/xmllint.xunit.xml" 6: Test timeout computed to be: 60 6: -- run_test.py: invoking following command in '/home/mfc/projects/autoware/src/universe/autoware.universe/simulator/fault_injection': 6: - /opt/ros/humble/bin/ament_xmllint --xunit-file /home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/xmllint.xunit.xml 6: File 'package.xml' is valid 6: 6: File 'launch/fault_injection.launch.xml' is valid 6: 6: File 'test/launch/test_fault_injection.launch.xml' is valid 6: 6: No problems found 6: -- run_test.py: return code 0 6: -- run_test.py: verify result file '/home/mfc/projects/autoware/build/fault_injection/test_results/fault_injection/xmllint.xunit.xml' 6/6 Test #6: xmllint ............................... Passed 8.39 sec 83% tests passed, 1 tests failed out of 6 Label Time Summary: copyright = 10.76 sec*proc (1 test) cppcheck = 5.31 sec*proc (1 test) gtest = 5.34 sec*proc (1 test) launch_test = 14.91 sec*proc (1 test) lint_cmake = 4.44 sec*proc (1 test) linter = 28.90 sec*proc (4 tests) xmllint = 8.39 sec*proc (1 test) Total Test time (real) = 49.23 sec The following tests FAILED: 2 - test_fault_injection_launch_test (Failed) Errors while running CTest Output from these tests are in: /home/mfc/projects/autoware/build/fault_injection/Testing/Temporary/LastTest.log Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely. --- --- stderr: fault_injection Errors while running CTest Output from these tests are in: /home/mfc/projects/autoware/build/fault_injection/Testing/Temporary/LastTest.log Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely. --- Finished <<< fault_injection [57.6s] [ with test failures ] Summary: 1 package finished [1min 32s] 1 package had stderr output: fault_injection 1 package had test failures: fault_injection ```
KeisukeShima commented 1 week ago

@xmfcx Thanks for your cooperate :+1: In the short term, extending evaluation time is a good idea. However, the root cause of the problem is the lack of resources in the testing environment. I have always struggled with that. Whether it is Flaky test or Flaky testing environment, our work is never done until we solve it.

xmfcx commented 1 week ago

@KeisukeShima the build-and-test jobs (not differential) are running on:

Either on:

On this failure instance, it was running on leo-copper machine

And if we look into the colcon test command, we can see it is being called with sequential flag: https://github.com/autowarefoundation/autoware-github-actions/blob/a1960b03b1d3f5c4320ec1d4e7916ffac437f4f1/colcon-test/action.yaml#L89-L93

colcon test --event-handlers console_cohesion+ \
  --mixin coverage-pytest \
  --packages-above ${{ inputs.target-packages }} \
  --executor sequential \
  --return-code-on-test-failure

However, the root cause of the problem is the lack of resources in the testing environment.

What would be your suggestion for improvements? Do you think for example:

We already run these sequentially and I think machines are strong enough. And by design of github-actions-runner, they can only perform one job at a time.

KeisukeShima commented 4 days ago

@xmfcx Thanks for the explanation. I would like to investigate as the situation seems to have changed from before.