eclipse-bluechi / bluechi

Eclipse BlueChi is a systemd service controller intended for multi-node environments with a predefined number of nodes and with a focus on highly regulated ecosystems such as those requiring functional safety.
https://bluechi.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
129 stars 37 forks source link

test: tmt run - 14 tests failed #455

Closed dougsland closed 1 year ago

dougsland commented 1 year ago

Describe the bug

Looks like the test scripts must be smart to clean the env before running nodes with the same name or randomly create names.

# cat /etc/fedora-release
Fedora release 38 (Thirty Eight)

#  systemctl --user status podman.socket
● podman.socket - Podman API Socket
     Loaded: loaded (/usr/lib/systemd/user/podman.socket; enabled; p>
     Active: active (listening) since Sat 2023-08-12 01:29:44 EDT; 9>
   Triggers: ● podman.service
       Docs: man:podman-system-service(1)
     Listen: /run/user/0/podman/podman.sock (Stream)
     CGroup: /user.slice/user-0.slice/user@0.service/app.slice/podma>
# tmt clean && tmt run
/var/tmp/tmt/run-001

/plans/tier0
    discover
        how: fmf
        directory: /root/tests-tmt-bluechi/bluechi/tests
        filter: tier:0
        summary: 15 tests selected
    provision
        how: local
        multihost name: default-0
        arch: x86_64
        distro: Fedora Linux 38 (Workstation Edition)
        summary: 1 guest provisioned
    prepare
        queued push task #1: push on default-0

        push task #1: push on default-0

        queued prepare task #1: Set containers setup on default-0

        prepare task #1: Set containers setup on default-0
        how: shell
        name: Set containers setup
        overview: 1 script found

        queued pull task #1: pull on default-0

        pull task #1: pull on default-0

        summary: 1 preparation applied
    execute
        queued execute task #1: default-0 on default-0

        execute task #1: default-0 on default-0
        how: tmt
        progress:
        summary: 15 tests executed
    report
        how: junit
        output: /var/tmp/tmt/run-001/plans/tier0/report/default-0/junit.xml
        summary: 1 test passed and 14 tests failed
    finish

        summary: 0 tasks completed

total: 1 test passed and 14 tests failed

Looking for error in the logs:

# grep error -rni /var/tmp/tmt/run-001/

Failed to setup hirte container: 500 Server Error: Internal Server Error (creating container storage: the container name "hirte-controller" is already in use by aae832d80353b59e41cb90b8ecfbaa1716f1fa8877084a22447693eedc9dd8de. You have to remove that container to be able to reuse that name: that name is already in use)
dougsland commented 1 year ago

tmt-logs

dougsland commented 1 year ago

I will investigate this one.

dougsland commented 1 year ago

Moving to @mwperina as he started looking this before me.

engelmi commented 1 year ago

First, this only affects locally running integration tests. In the GH CI this works fine - I don't know why, unfortunately.

As the error clearly states, the cause for failure is that there is already a container with a specific name (the controller in this case) when trying to start another one. This can have multiple issues. For example, it happened for me that the first hirte-controller container is started and a subsequent command on that container fails due to "the container not being ready". Then all other tests after it will fail since the pending hirte-controller is blocking. The "container not ready" error should've been solved by waiting for the condition=running, for example, but it didn't - not sure if this is a bug in or misuse of the podman python API.

One possible approach to avoid those cascading failures could be to not set a container name - which is fine since we only use the python reference to the containers anyway. There would be a need to properly cleanup those pending containers. I don't know how to tackle the "container not ready" error since it should not happen when waiting for its running condition - well, should. Do you have any idea? @dougsland @mwperina

dougsland commented 1 year ago
podman stop control
podman stop node-1
podman stop node-foo
podman stop node-bar
podman container prune

made the trick for now, if appears again I will investigate if we can workaround this.