eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 22 forks source link

Unstable system tests #363

Open inf17101 opened 2 months ago

inf17101 commented 2 months ago

Currently, a lot of system tests fail locally.

When deleting the dev container and executing prune commands to cleanup Docker completely and building a new devcontainer, less system tests will fail.

Current Behavior

System tests are failing due to different reasons.

Expected Behavior

All system tests shall pass.

Steps to Reproduce

tools/run_robot_tests.sh tests/

Context (Environment)

Logs

Additional Information

Final result

To be filled by the one closing the issue.

inf17101 commented 2 months ago

I started investigating a little bit since three contributors have the same issues. All using WSL2.

First, some tests are failing because rootless ports a still kept open after the containers of a test have been deleted.

Port 8081/tcp6 is still open after the system test and some system tests fail and report "port already in use"

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name    
tcp        0      0 0.0.0.0:8000            0.0.0.0:*               LISTEN      1000       4413747    122830/python3      
tcp        0      0 127.0.0.1:40251         0.0.0.0:*               LISTEN      1000       3385049    321/node            
tcp6       0      0 :::8081                 :::*                    LISTEN      1000       3450044    17172/rootlessport  

Log output of some randomly picked failing system test:

[2024-08-30T11:55:10Z INFO  ank_agent::workload::workload_control_loop] Retry '3' out of '20': Failed to create workload: 'nginx_from_manifest1.2ba2d59f84d66954e06fc5f5a59f64ad053dc2963bc404ccdf932d9fd15231ba.agent_A': 'Error: rootlessport listen tcp 0.0.0.0:8081: bind: address already in use. Execution of 'podman "run" "--detach" "--name" "nginx_from_manifest1.2ba2d59f84d66954e06fc5f5a59f64ad053dc2963bc404ccdf932d9fd15231ba.agent_A" "-p" "8081:80" "--mount=type=bind,source=/tmp/ankaios/agent_A_io/nginx_from_manifest1.2ba2d59f84d66954e06fc5f5a59f64ad053dc2963bc404ccdf932d9fd15231ba,destination=/run/ankaios/control_interface" "--label=name=nginx_from_manifest1.2ba2d59f84d66954e06fc5f5a59f64ad053dc2963bc404ccdf932d9fd15231ba.agent_A" "--label=agent=agent_A" "ghcr.io/eclipse-ankaios/tests/nginx:alpine-slim"''

Next issue, the test Test Ankaios Podman create a container with custom name is also in the list that fails because of a mixture of "container storage is already in use" and the previous reported issue "rootlessport listen tcp 0.0.0.0:8081: bind: address already in use" :

Agent log line:

[2024-08-30T11:57:15Z DEBUG ank_agent::runtime_connectors::podman::podman_runtime] Creating container failed, cleaning up. Error: 'Error: creating container storage: the container name "test_workload1" is already in use by 0cbb0cc7c1723692e70db68e629e5bd8b2727c5c05f4ece95a726b6e3b6d39c7. You have to remove that container to be able to reuse that name: that name is already in use, or use --replace to instruct Podman to do so.. Execution of 'podman "run" "--detach" "--name" "nginx.53ff05e05f7469d11069f2bca8b1e6328e1d63b22743032472de3b8831126479.agent_A" "-p" "8081:80" "--name" "test_workload1" "--mount=type=bind,source=/tmp/ankaios/agent_A_io/nginx.53ff05e05f7469d11069f2bca8b1e6328e1d63b22743032472de3b8831126479,destination=/run/ankaios/control_interface" "--label=name=nginx.53ff05e05f7469d11069f2bca8b1e6328e1d63b22743032472de3b8831126479.agent_A" "--label=agent=agent_A" "ghcr.io/eclipse-ankaios/tests/nginx:alpine-slim"''

Sometimes all authorization system tests are failing. In the logs it is mentioned that the tmp directory which is mounted inside the control_interface_tester workload does not exist. And I tested with ls command and indeed it is not created properly sometimes. It could be also a permission issue somehow.

[2024-08-30T10:12:16Z DEBUG ank_agent::runtime_connectors::podman::podman_runtime] Creating container failed, cleaning up. Error: 'Error: statfs /tmp/tmp6w8q3hdi: no such file or directory. Execution of 'podman "run" "--detach" "--name" "controller.98435576aaff3af8f7d1673537cc80b0068b7edc5fe30c609aea021a684f2205.agent_A" "--mount=type=bind,source=/tmp/tmp6w8q3hdi,destination=/data/" "--restart=no" "--label=name=controller.98435576aaff3af8f7d1673537cc80b0068b7edc5fe30c609aea021a684f2205.agent_A" "--label=agent=agent_A" "ghcr.io/eclipse-ankaios/control_interface_tester:manual-build-1" "/data/commands.yaml" "/data/output.yaml"''
windsource commented 2 months ago

I think the problem is somehow related to VS Code. When I use VS Code on a Manjora VM and execute the system tests in a dev container a lot of the tests fails. But when I use devcontainer cli to start the dev container and then execute the systems tests, all of them pass.

inf17101 commented 2 months ago

I think the problem is somehow related to VS Code. When I use VS Code on a Manjora VM and execute the system tests in a dev container a lot of the tests fails. But when I use devcontainer cli to start the dev container and then execute the systems tests, all of them pass.

I think that could be the issue, because inside the ci/cd pipeline they are all green all the time. But what can be go wrong that for example the tmp directories for the authorization stests are not created sometimes. :-D This is strange then.

windsource commented 2 months ago

With VSCode Insiders the tests also pass.

windsource commented 1 month ago

With #368 and #369 we have two fixes for this problem. Currently all the tests pass, locally and CI.

krucod3 commented 1 week ago

We also need a proper fix for #404

krucod3 commented 4 days ago

@windsource mentioned that also system migrate needed to be done a couple of times. Maybe we can add this before the tests run.