Open inf17101 opened 2 months ago
I started investigating a little bit since three contributors have the same issues. All using WSL2.
First, some tests are failing because rootless ports a still kept open after the containers of a test have been deleted.
Port 8081/tcp6 is still open after the system test and some system tests fail and report "port already in use"
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name
tcp 0 0 0.0.0.0:8000 0.0.0.0:* LISTEN 1000 4413747 122830/python3
tcp 0 0 127.0.0.1:40251 0.0.0.0:* LISTEN 1000 3385049 321/node
tcp6 0 0 :::8081 :::* LISTEN 1000 3450044 17172/rootlessport
Log output of some randomly picked failing system test:
[2024-08-30T11:55:10Z INFO ank_agent::workload::workload_control_loop] Retry '3' out of '20': Failed to create workload: 'nginx_from_manifest1.2ba2d59f84d66954e06fc5f5a59f64ad053dc2963bc404ccdf932d9fd15231ba.agent_A': 'Error: rootlessport listen tcp 0.0.0.0:8081: bind: address already in use. Execution of 'podman "run" "--detach" "--name" "nginx_from_manifest1.2ba2d59f84d66954e06fc5f5a59f64ad053dc2963bc404ccdf932d9fd15231ba.agent_A" "-p" "8081:80" "--mount=type=bind,source=/tmp/ankaios/agent_A_io/nginx_from_manifest1.2ba2d59f84d66954e06fc5f5a59f64ad053dc2963bc404ccdf932d9fd15231ba,destination=/run/ankaios/control_interface" "--label=name=nginx_from_manifest1.2ba2d59f84d66954e06fc5f5a59f64ad053dc2963bc404ccdf932d9fd15231ba.agent_A" "--label=agent=agent_A" "ghcr.io/eclipse-ankaios/tests/nginx:alpine-slim"''
Next issue, the test Test Ankaios Podman create a container with custom name
is also in the list that fails because of a mixture of "container storage is already in use" and the previous reported issue "rootlessport listen tcp 0.0.0.0:8081: bind: address already in use" :
Agent log line:
[2024-08-30T11:57:15Z DEBUG ank_agent::runtime_connectors::podman::podman_runtime] Creating container failed, cleaning up. Error: 'Error: creating container storage: the container name "test_workload1" is already in use by 0cbb0cc7c1723692e70db68e629e5bd8b2727c5c05f4ece95a726b6e3b6d39c7. You have to remove that container to be able to reuse that name: that name is already in use, or use --replace to instruct Podman to do so.. Execution of 'podman "run" "--detach" "--name" "nginx.53ff05e05f7469d11069f2bca8b1e6328e1d63b22743032472de3b8831126479.agent_A" "-p" "8081:80" "--name" "test_workload1" "--mount=type=bind,source=/tmp/ankaios/agent_A_io/nginx.53ff05e05f7469d11069f2bca8b1e6328e1d63b22743032472de3b8831126479,destination=/run/ankaios/control_interface" "--label=name=nginx.53ff05e05f7469d11069f2bca8b1e6328e1d63b22743032472de3b8831126479.agent_A" "--label=agent=agent_A" "ghcr.io/eclipse-ankaios/tests/nginx:alpine-slim"''
Sometimes all authorization system tests are failing. In the logs it is mentioned that the tmp directory which is mounted inside the control_interface_tester
workload does not exist. And I tested with ls command and indeed it is not created properly sometimes. It could be also a permission issue somehow.
[2024-08-30T10:12:16Z DEBUG ank_agent::runtime_connectors::podman::podman_runtime] Creating container failed, cleaning up. Error: 'Error: statfs /tmp/tmp6w8q3hdi: no such file or directory. Execution of 'podman "run" "--detach" "--name" "controller.98435576aaff3af8f7d1673537cc80b0068b7edc5fe30c609aea021a684f2205.agent_A" "--mount=type=bind,source=/tmp/tmp6w8q3hdi,destination=/data/" "--restart=no" "--label=name=controller.98435576aaff3af8f7d1673537cc80b0068b7edc5fe30c609aea021a684f2205.agent_A" "--label=agent=agent_A" "ghcr.io/eclipse-ankaios/control_interface_tester:manual-build-1" "/data/commands.yaml" "/data/output.yaml"''
I think the problem is somehow related to VS Code. When I use VS Code on a Manjora VM and execute the system tests in a dev container a lot of the tests fails. But when I use devcontainer cli to start the dev container and then execute the systems tests, all of them pass.
I think the problem is somehow related to VS Code. When I use VS Code on a Manjora VM and execute the system tests in a dev container a lot of the tests fails. But when I use devcontainer cli to start the dev container and then execute the systems tests, all of them pass.
I think that could be the issue, because inside the ci/cd pipeline they are all green all the time. But what can be go wrong that for example the tmp directories for the authorization stests are not created sometimes. :-D This is strange then.
With VSCode Insiders the tests also pass.
With #368 and #369 we have two fixes for this problem. Currently all the tests pass, locally and CI.
We also need a proper fix for #404
@windsource mentioned that also system migrate needed to be done a couple of times. Maybe we can add this before the tests run.
Currently, a lot of system tests fail locally.
When deleting the dev container and executing prune commands to cleanup Docker completely and building a new devcontainer, less system tests will fail.
Current Behavior
System tests are failing due to different reasons.
Expected Behavior
All system tests shall pass.
Steps to Reproduce
tools/run_robot_tests.sh tests/
Context (Environment)
Logs
Additional Information
Final result
To be filled by the one closing the issue.