Test of `lidar_apollo_segmentation_tvm` failing in CI

kminoda commented 7 months ago

Checklist

[X] I've read the contribution guidelines.
[X] I've searched other issues and no duplicate issues were found.
[X] I'm convinced that this is not my fault but a bug.

Description

Test of lidar_apollo_segmentation_tvm failing

says:

1:   Check failed: (lib_handle_ != nullptr) is false: Failed to load dynamic shared library /github/home/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn/./deploy_lib.so /github/home/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn/./deploy_lib.so: cannot open shared object file: No such file or directory

https://github.com/autowarefoundation/autoware.universe/actions/runs/7080300141/job/19267993325

    Start 1: lidar_apollo_segmentation_tvm_gtest

1: Test command: /usr/bin/python3.10 "-u" "/opt/ros/humble/share/ament_cmake_test/cmake/run_test.py" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lidar_apollo_segmentation_tvm_gtest.gtest.xml" "--package-name" "lidar_apollo_segmentation_tvm" "--output-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/ament_cmake_gtest/lidar_apollo_segmentation_tvm_gtest.txt" "--command" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/lidar_apollo_segmentation_tvm_gtest" "--gtest_output=xml:/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lidar_apollo_segmentation_tvm_gtest.gtest.xml"
1: Test timeout computed to be: 120
1: -- run_test.py: invoking following command in '/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm':
1:  - /__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/lidar_apollo_segmentation_tvm_gtest --gtest_output=xml:/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lidar_apollo_segmentation_tvm_gtest.gtest.xml
1: Running main() from /opt/ros/humble/src/gtest_vendor/src/gtest_main.cc
1: [==========] Running 1 test from 1 test suite.
1: [----------] Global test environment set-up.
1: [----------] 1 test from lidar_apollo_segmentation_tvm
1: [ RUN      ] lidar_apollo_segmentation_tvm.others
1: unknown file: Failure
1: C++ exception with description "[02:50:53] ./.obj-aarch64-linux-gnu/tvm-build-prefix/src/tvm-build/src/runtime/dso_library.cc:119: 
1: ---------------------------------------------------------------
1: An error occurred during the execution of TVM.
1: For more information, please see: https://tvm.apache.org/docs/errors.html
1: ---------------------------------------------------------------
1:   Check failed: (lib_handle_ != nullptr) is false: Failed to load dynamic shared library /github/home/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn/./deploy_lib.so /github/home/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn/./deploy_lib.so: cannot open shared object file: No such file or directory
1: Stack trace:
1:   0: 0x0000ffffb3761d9b
1:   1: 0x0000ffffb387be67
1:   2: tvm::runtime::DSOLibrary::Load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
1:   3: tvm::runtime::CreateDSOLibraryObject(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
1:   4: 0x0000ffffb379679b
1:   5: tvm::runtime::Module::LoadFromFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
1:   6: tvm_utility::pipeline::InferenceEngineTVM::InferenceEngineTVM(tvm_utility::pipeline::InferenceEngineTVMConfig const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
1:   7: autoware::perception::lidar_apollo_segmentation_tvm::ApolloLidarSegmentation::ApolloLidarSegmentation(int, float, bool, bool, float, float, float, float, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
1:   8: test_segmentation(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool, bool)
1:   9: lidar_apollo_segmentation_tvm_others_Test::TestBody()
1:   10: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)
1:   11: testing::Test::Run()
1:   12: testing::TestInfo::Run()
1:   13: testing::TestSuite::Run()
1:   14: testing::internal::UnitTestImpl::RunAllTests()
1:   15: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)
1:   16: testing::UnitTest::Run()
1:   17: main
1:   18: 0x0000ffffb3c973fb
1:   19: __libc_start_main
1:   20: _start
1:   21: 0xffffffffffffffff
1: 
1: " thrown in the test body.
1: [  FAILED  ] lidar_apollo_segmentation_tvm.others (42 ms)
1: [----------] 1 test from lidar_apollo_segmentation_tvm (42 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 1 test from 1 test suite ran. (42 ms total)
1: [  PASSED  ] 0 tests.
1: [  FAILED  ] 1 test, listed below:
1: [  FAILED  ] lidar_apollo_segmentation_tvm.others
1: 
1:  1 FAILED TEST
1: -- run_test.py: return code 1
1: -- run_test.py: verify result file '/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lidar_apollo_segmentation_tvm_gtest.gtest.xml'
1/5 Test #1: lidar_apollo_segmentation_tvm_gtest ...***Failed    0.15 sec
test 2
    Start 2: copyright

2: Test command: /usr/bin/python3.10 "-u" "/opt/ros/humble/share/ament_cmake_test/cmake/run_test.py" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/copyright.xunit.xml" "--package-name" "lidar_apollo_segmentation_tvm" "--output-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/ament_copyright/copyright.txt" "--command" "/opt/ros/humble/bin/ament_copyright" "--xunit-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/copyright.xunit.xml"
2: Test timeout computed to be: 200
2: -- run_test.py: invoking following command in '/__w/autoware.universe/autoware.universe/perception/lidar_apollo_segmentation_tvm':
2:  - /opt/ros/humble/bin/ament_copyright --xunit-file /__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/copyright.xunit.xml
2: No problems found, checked 15 files
2: -- run_test.py: return code 0
2: -- run_test.py: verify result file '/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/copyright.xunit.xml'
2/5 Test #2: copyright .............................   Passed    0.80 sec
test 3
    Start 3: cppcheck

3: Test command: /usr/bin/python3.10 "-u" "/opt/ros/humble/share/ament_cmake_test/cmake/run_test.py" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/cppcheck.xunit.xml" "--package-name" "lidar_apollo_segmentation_tvm" "--output-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/ament_cppcheck/cppcheck.txt" "--command" "/opt/ros/humble/bin/ament_cppcheck" "--xunit-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/cppcheck.xunit.xml" "--include_dirs" "/__w/autoware.universe/autoware.universe/perception/lidar_apollo_segmentation_tvm/include" "/__w/autoware.universe/autoware.universe/perception/lidar_apollo_segmentation_tvm/data/models"
3: Test timeout computed to be: 300
3: -- run_test.py: invoking following command in '/__w/autoware.universe/autoware.universe/perception/lidar_apollo_segmentation_tvm':
3:  - /opt/ros/humble/bin/ament_cppcheck --xunit-file /__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/cppcheck.xunit.xml --include_dirs /__w/autoware.universe/autoware.universe/perception/lidar_apollo_segmentation_tvm/include /__w/autoware.universe/autoware.universe/perception/lidar_apollo_segmentation_tvm/data/models
3: cppcheck 2.7 has known performance issues and therefore will not be used, set the AMENT_CPPCHECK_ALLOW_SLOW_VERSIONS environment variable to override this.
3: -- run_test.py: return code 0
3: -- run_test.py: verify result file '/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/cppcheck.xunit.xml'
3/5 Test #3: cppcheck ..............................   Passed    0.21 sec
test 4
    Start 4: lint_cmake

4: Test command: /usr/bin/python3.10 "-u" "/opt/ros/humble/share/ament_cmake_test/cmake/run_test.py" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lint_cmake.xunit.xml" "--package-name" "lidar_apollo_segmentation_tvm" "--output-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/ament_lint_cmake/lint_cmake.txt" "--command" "/opt/ros/humble/bin/ament_lint_cmake" "--xunit-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lint_cmake.xunit.xml"
4: Test timeout computed to be: 60
4: -- run_test.py: invoking following command in '/__w/autoware.universe/autoware.universe/perception/lidar_apollo_segmentation_tvm':
4:  - /opt/ros/humble/bin/ament_lint_cmake --xunit-file /__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lint_cmake.xunit.xml
4: 
4: No problems found
4: -- run_test.py: return code 0
4: -- run_test.py: verify result file '/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lint_cmake.xunit.xml'
4/5 Test #4: lint_cmake ............................   Passed    0.21 sec
test 5
    Start 5: xmllint

5: Test command: /usr/bin/python3.10 "-u" "/opt/ros/humble/share/ament_cmake_test/cmake/run_test.py" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/xmllint.xunit.xml" "--package-name" "lidar_apollo_segmentation_tvm" "--output-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/ament_xmllint/xmllint.txt" "--command" "/opt/ros/humble/bin/ament_xmllint" "--xunit-file" "/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/xmllint.xunit.xml"
5: Test timeout computed to be: 60
5: -- run_test.py: invoking following command in '/__w/autoware.universe/autoware.universe/perception/lidar_apollo_segmentation_tvm':
5:  - /opt/ros/humble/bin/ament_xmllint --xunit-file /__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/xmllint.xunit.xml
5: File 'package.xml' is valid
5: 
5: No problems found
5: -- run_test.py: return code 0
5: -- run_test.py: verify result file '/__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/xmllint.xunit.xml'
5/5 Test #5: xmllint ...............................   Passed    0.21 sec

80% tests passed, 1 tests failed out of 5

Label Time Summary:
copyright     =   0.80 sec*proc (1 test)
cppcheck      =   0.21 sec*proc (1 test)
gtest         =   0.15 sec*proc (1 test)
lint_cmake    =   0.21 sec*proc (1 test)
linter        =   1.44 sec*proc (4 tests)
xmllint       =   0.21 sec*proc (1 test)

Total Test time (real) =   1.59 sec

The following tests FAILED:
      1 - lidar_apollo_segmentation_tvm_gtest (Failed)
Errors while running CTest
Output from these tests are in: /__w/autoware.universe/autoware.universe/build/lidar_apollo_segmentation_tvm/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
---
Finished <<< lidar_apollo_segmentation_tvm [1.71s]  [ with test failures ]

Expected behavior

Should pass

Actual behavior

Does not pass

Steps to reproduce

See the CI result

Versions

No response

Possible causes

No response

Additional context

No response

kminoda commented 7 months ago

@lexavtanke @ambroise-arm @angry-crab Hi, recently the build-and-test CIs are failing partially due to this package's test failure. Mentioning you since this PR might be relevant: https://github.com/autowarefoundation/autoware.universe/pull/5431. Any ideas how to fix this issue?

ambroise-arm commented 7 months ago

Note: from looking quickly at the github actions, it seems to only happen on the self-hosted aarch64 pipelines, and it seems it has only been happening for a few days. Unless the failure has some random component to it, as I didn't check all pipelines.

kminoda commented 7 months ago

@ambroise-arm Thank you for checking. I suspect this is has been occurring for days, just been hidden by the build failure that has been fixed a couple of day before with this PR: https://github.com/autowarefoundation/autoware.universe/pull/5749

ambroise-arm commented 7 months ago

I couldn't reproduce locally. It is not obvious to me what is wrong.

kminoda commented 7 months ago

@ambroise-arm As stated in the description, I think it's because the CI cannot access deploy_lib.so

ambroise-arm commented 7 months ago

Indeed. But it is expected that this deploy_lib.so file (and a couple of other files) will not be present in CI. Some code was added to the tests to skip them if the files are not present: https://github.com/autowarefoundation/autoware.universe/blob/60b4030abd332308aa1e902fcf02de1ffc940792/perception/lidar_apollo_segmentation_tvm/test/main.cpp#L109-L115

I don't know why it works as expected on the x86 CI runners (as well as locally for me on an arm64 machine), but fails on the arm64 CI runners.

kminoda commented 7 months ago

Now make sense. But have no idea why the CI fails :thinking:

stale[bot] commented 4 months ago

This pull request has been automatically marked as stale because it has not had recent activity.

JunWu-ARM commented 1 week ago

@kminoda , I can try to help on this topic. Firstly pls confirm if this problem still exists, I have tried to build/run in our ARM64 test target, can not repeat the problem so far.

dev@ava-0:/workspace$ colcon test-result --all

build/lidar_apollo_segmentation_tvm/Testing/20240627-0829/Test.xml: 5 tests, 0 errors, 0 failures, 0 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/copyright.xunit.xml: 15 tests, 0 errors, 0 failures, 0 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/cppcheck.xunit.xml: 15 tests, 0 errors, 0 failures, 15 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lidar_apollo_segmentation_tvm_gtest.gtest.xml: 1 test, 0 errors, 0 failures, 0 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lint_cmake.xunit.xml: 2 tests, 0 errors, 0 failures, 0 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/xmllint.xunit.xml: 1 test, 0 errors, 0 failures, 0 skipped

Summary: 39 tests, 0 errors, 0 failures, 15 skipped
dev@ava-0:/workspace$
dev@ava-0:/workspace$ uname -a
Linux ava-0 5.4.0-176-generic #196-Ubuntu SMP Fri Mar 22 16:46:20 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

kminoda commented 1 week ago

@JunWu-ARM Thank you for checking. Actually it seems that we no longer have this issue in build-and-test-daily-arm64, so I think we can close this issue.

ambroise-arm commented 1 week ago

@kminoda I see that there are passing pipelines in https://github.com/autowarefoundation/autoware.universe/actions/workflows/build-and-test-daily-arm64.yaml, but inside of those I only see lidar_apollo_segmentation_tvm built, but not tested. It seems the issue still exists.

EDIT: It probably got removed following the remark from Fatih in https://github.com/orgs/autowarefoundation/discussions/4794#discussioncomment-9605616

JunWu-ARM commented 6 days ago

@kminoda Did some investigation and found it could be because of the pre-installed /root/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn/deploy_lib.so in docker image is actually an x86_64 version of library which may lead to the failure to load it.

My steps for the test:

# I am using an interval arm64 testing target to run the test
dev@ava-0:junwu01$ uname -a
Linux ava-0 5.4.0-176-generic #196-Ubuntu SMP Fri Mar 22 16:46:20 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

# Install Autoware with docker
# Pull the latest devel image for arm64 https://github.com/autowarefoundation/autoware/pkgs/container/autoware/231944498?tag=latest-devel
docker pull ghcr.io/autowarefoundation/autoware:refs-tags-openadkit-vtest.0.0-devel@sha256:547e6e11a5d64574dd9aa92365a492725b932796d7203d5ed37d88feede3fb2e

git clone https://github.com/autowarefoundation/autoware.git
cd autoware

# run the docker image
./docker/run.sh --devel --no-nvidia  --headless

# from now you are inside a docker
cd /workspace

# Get all src repos
mkdir src
vcs import src < autoware.repos

# Update dependent ROS packages.
sudo apt update
rosdep update
rosdep install -y --from-paths src --ignore-src --rosdistro $ROS_DISTRO

# Build the workspace
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release

# Build a package
colcon build --packages-up-to lidar_apollo_segmentation_tvm

# Unit test a package
colcon test --packages-select lidar_apollo_segmentation_tvm

# Show test results
colcon test-result --all

What I found is inside the docker, the pre-installed library is for x86_64 architecture

root@ava-0:~/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn# readelf deploy_lib.so -h
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x2060
  Start of program headers:          64 (bytes into file)
  Start of section headers:          266120 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         37
  Section header string table index: 36
 -at@ava-0:~/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn# uname
Linux ava-0 5.4.0-176-generic #196-Ubuntu SMP Fri Mar 22 16:46:20 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
/..t@ava-0:~/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn# ls ..
baidu_cnn-x86_64-llvm-3.0.0-20221221.tar.gz  models

# There are several other lib also for x86_64, need confirm what is the impact
root@ava-0:~/autoware_data# ls -R | grep x86
baidu_cnn-x86_64-llvm-3.0.0-20221221.tar.gz
centerpoint_backbone-x86_64-llvm-3.0.0-20221221.tar.gz
centerpoint_encoder-x86_64-llvm-3.0.0-20221221.tar.gz
yolo_v2_tiny-x86_64-llvm-3.0.0-20221221.tar.gz

And we can see (https://github.com/autowarefoundation/autoware/blob/main/ansible/roles/artifacts/tasks/main.yaml) do use the x86 prebuilt binary, need confirm if that is the root cause.

JunWu-ARM commented 2 days ago

@oguzkaganozt , @youtalk, could you please review my investigation ?

Summary:

Whether the test will report an error depends on who runs the test:

non-root will report success because the user can not open the /root/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn/./deploy_lib.so (no permission), then the test application will skip the test because it will think the .so does not exist, then no error reported.
root user will report the error because it can see the .so, then will try to load it, but failed because the deploy_lib.so is a x86_64 version.

In our development environment, we login as host user by run ./docker/run.sh so the user in docker is non-root, so it reports success (detailed log show test skipped).

In the autoware github pipeline, it runs docker image as root so it reports the error e.g. build-and-test-daily-arm64 · autowarefoundation/autoware.universe@c2f9579 (github.com) At "Initialize Containers", the following cmd create and run the docker image

/usr/bin/docker create -name 8ffb922c9e784952807e50a2957540e9_ghcrioautowarefoundationautowarelatestprebuilt_96069e --label 591d3a --workdir /_w/autoware.universe/autoware.universe --network github_network_098334ac17b34c44924c79f4b34be4a4 -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/ubuntu/actions-runner/_work":"/w" -v "/home/ubuntu/actions-runner/externals":"/e":ro -v "/home/ubuntu/actions-runner/_work/_temp":"/w/_temp" -v "/home/ubuntu/actions-runner/_work/_actions":"/w/_actions" -v "/home/ubuntu/actions-runner/_work/_tool":"/_w/_tool" -v "/home/ubuntu/actions-runner/_work/_temp/_github_home":"/github/home" -v "/home/ubuntu/actions-runner/_work/_temp/_github_workflow":"/github/workflow" -entrypoint "tail" ghcr.io/autowarefoundation/autoware:latest-prebuilt "-f" "/dev/null" d666295bb0a9aa46129819b0559ff5ab6260ea7a0a128bc1b1962ccbd2b79b80 /usr/bin/docker start d666295bb0a9aa46129819b0559ff5ab6260ea7a0a128bc1b1962ccbd2b79b80

Please note that the user inside the Docker container depends on how the Docker image was built. If no user is set during the build of the Docker image, the default user inside the container will be root (uid 0) (or we can set userid and groupid when run the docker)

autowarefoundation / autoware.universe