Open kminoda opened 7 months ago
@lexavtanke @ambroise-arm @angry-crab Hi, recently the build-and-test CIs are failing partially due to this package's test failure. Mentioning you since this PR might be relevant: https://github.com/autowarefoundation/autoware.universe/pull/5431. Any ideas how to fix this issue?
Note: from looking quickly at the github actions, it seems to only happen on the self-hosted aarch64 pipelines, and it seems it has only been happening for a few days. Unless the failure has some random component to it, as I didn't check all pipelines.
@ambroise-arm Thank you for checking. I suspect this is has been occurring for days, just been hidden by the build failure that has been fixed a couple of day before with this PR: https://github.com/autowarefoundation/autoware.universe/pull/5749
I couldn't reproduce locally. It is not obvious to me what is wrong.
@ambroise-arm As stated in the description, I think it's because the CI cannot access deploy_lib.so
Indeed. But it is expected that this deploy_lib.so file (and a couple of other files) will not be present in CI. Some code was added to the tests to skip them if the files are not present: https://github.com/autowarefoundation/autoware.universe/blob/60b4030abd332308aa1e902fcf02de1ffc940792/perception/lidar_apollo_segmentation_tvm/test/main.cpp#L109-L115
I don't know why it works as expected on the x86 CI runners (as well as locally for me on an arm64 machine), but fails on the arm64 CI runners.
Now make sense. But have no idea why the CI fails :thinking:
This pull request has been automatically marked as stale because it has not had recent activity.
@kminoda , I can try to help on this topic. Firstly pls confirm if this problem still exists, I have tried to build/run in our ARM64 test target, can not repeat the problem so far.
dev@ava-0:/workspace$ colcon test-result --all
build/lidar_apollo_segmentation_tvm/Testing/20240627-0829/Test.xml: 5 tests, 0 errors, 0 failures, 0 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/copyright.xunit.xml: 15 tests, 0 errors, 0 failures, 0 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/cppcheck.xunit.xml: 15 tests, 0 errors, 0 failures, 15 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lidar_apollo_segmentation_tvm_gtest.gtest.xml: 1 test, 0 errors, 0 failures, 0 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/lint_cmake.xunit.xml: 2 tests, 0 errors, 0 failures, 0 skipped
build/lidar_apollo_segmentation_tvm/test_results/lidar_apollo_segmentation_tvm/xmllint.xunit.xml: 1 test, 0 errors, 0 failures, 0 skipped
Summary: 39 tests, 0 errors, 0 failures, 15 skipped
dev@ava-0:/workspace$
dev@ava-0:/workspace$ uname -a
Linux ava-0 5.4.0-176-generic #196-Ubuntu SMP Fri Mar 22 16:46:20 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
@JunWu-ARM Thank you for checking. Actually it seems that we no longer have this issue in build-and-test-daily-arm64
, so I think we can close this issue.
@kminoda I see that there are passing pipelines in https://github.com/autowarefoundation/autoware.universe/actions/workflows/build-and-test-daily-arm64.yaml, but inside of those I only see lidar_apollo_segmentation_tvm
built, but not tested.
It seems the issue still exists.
EDIT: It probably got removed following the remark from Fatih in https://github.com/orgs/autowarefoundation/discussions/4794#discussioncomment-9605616
@kminoda
Did some investigation and found it could be because of the pre-installed /root/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn/deploy_lib.so
in docker image is actually an x86_64 version of library which may lead to the failure to load it.
My steps for the test:
# I am using an interval arm64 testing target to run the test
dev@ava-0:junwu01$ uname -a
Linux ava-0 5.4.0-176-generic #196-Ubuntu SMP Fri Mar 22 16:46:20 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
# Install Autoware with docker
# Pull the latest devel image for arm64 https://github.com/autowarefoundation/autoware/pkgs/container/autoware/231944498?tag=latest-devel
docker pull ghcr.io/autowarefoundation/autoware:refs-tags-openadkit-vtest.0.0-devel@sha256:547e6e11a5d64574dd9aa92365a492725b932796d7203d5ed37d88feede3fb2e
git clone https://github.com/autowarefoundation/autoware.git
cd autoware
# run the docker image
./docker/run.sh --devel --no-nvidia --headless
# from now you are inside a docker
cd /workspace
# Get all src repos
mkdir src
vcs import src < autoware.repos
# Update dependent ROS packages.
sudo apt update
rosdep update
rosdep install -y --from-paths src --ignore-src --rosdistro $ROS_DISTRO
# Build the workspace
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release
# Build a package
colcon build --packages-up-to lidar_apollo_segmentation_tvm
# Unit test a package
colcon test --packages-select lidar_apollo_segmentation_tvm
# Show test results
colcon test-result --all
What I found is inside the docker, the pre-installed library is for x86_64 architecture
root@ava-0:~/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn# readelf deploy_lib.so -h
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Shared object file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x2060
Start of program headers: 64 (bytes into file)
Start of section headers: 266120 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 9
Size of section headers: 64 (bytes)
Number of section headers: 37
Section header string table index: 36
-at@ava-0:~/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn# uname
Linux ava-0 5.4.0-176-generic #196-Ubuntu SMP Fri Mar 22 16:46:20 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
/..t@ava-0:~/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn# ls ..
baidu_cnn-x86_64-llvm-3.0.0-20221221.tar.gz models
# There are several other lib also for x86_64, need confirm what is the impact
root@ava-0:~/autoware_data# ls -R | grep x86
baidu_cnn-x86_64-llvm-3.0.0-20221221.tar.gz
centerpoint_backbone-x86_64-llvm-3.0.0-20221221.tar.gz
centerpoint_encoder-x86_64-llvm-3.0.0-20221221.tar.gz
yolo_v2_tiny-x86_64-llvm-3.0.0-20221221.tar.gz
And we can see (https://github.com/autowarefoundation/autoware/blob/main/ansible/roles/artifacts/tasks/main.yaml
) do use the x86 prebuilt binary, need confirm if that is the root cause.
@oguzkaganozt , @youtalk, could you please review my investigation ?
Whether the test will report an error depends on who runs the test:
non-root will report success because the user can not open the /root/autoware_data/lidar_apollo_segmentation_tvm/models/baidu_cnn/./deploy_lib.so (no permission), then the test application will skip the test because it will think the .so does not exist, then no error reported.
root user will report the error because it can see the .so, then will try to load it, but failed because the deploy_lib.so is a x86_64 version.
In our development environment, we login as host user by run ./docker/run.sh so the user in docker is non-root, so it reports success (detailed log show test skipped).
In the autoware github pipeline, it runs docker image as root so it reports the error e.g. build-and-test-daily-arm64 · autowarefoundation/autoware.universe@c2f9579 (github.com) At "Initialize Containers", the following cmd create and run the docker image
/usr/bin/docker create -name 8ffb922c9e784952807e50a2957540e9_ghcrioautowarefoundationautowarelatestprebuilt_96069e --label 591d3a --workdir /_w/autoware.universe/autoware.universe --network github_network_098334ac17b34c44924c79f4b34be4a4 -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/ubuntu/actions-runner/_work":"/w" -v "/home/ubuntu/actions-runner/externals":"/e":ro -v "/home/ubuntu/actions-runner/_work/_temp":"/w/_temp" -v "/home/ubuntu/actions-runner/_work/_actions":"/w/_actions" -v "/home/ubuntu/actions-runner/_work/_tool":"/_w/_tool" -v "/home/ubuntu/actions-runner/_work/_temp/_github_home":"/github/home" -v "/home/ubuntu/actions-runner/_work/_temp/_github_workflow":"/github/workflow" -entrypoint "tail" ghcr.io/autowarefoundation/autoware:latest-prebuilt "-f" "/dev/null" d666295bb0a9aa46129819b0559ff5ab6260ea7a0a128bc1b1962ccbd2b79b80 /usr/bin/docker start d666295bb0a9aa46129819b0559ff5ab6260ea7a0a128bc1b1962ccbd2b79b80
Please note that the user inside the Docker container depends on how the Docker image was built. If no user is set during the build of the Docker image, the default user inside the container will be root (uid 0) (or we can set userid and groupid when run the docker)
Checklist
Description
Test of
lidar_apollo_segmentation_tvm
failingsays:
https://github.com/autowarefoundation/autoware.universe/actions/runs/7080300141/job/19267993325
Expected behavior
Should pass
Actual behavior
Does not pass
Steps to reproduce
See the CI result
Versions
No response
Possible causes
No response
Additional context
No response