canonical / checkbox

Checkbox is a testing framework used to validate device compatibility with Ubuntu Linux. It’s the testing tool developed for the purposes of the Ubuntu Certification program.
https://checkbox.readthedocs.io
GNU General Public License v3.0
34 stars 50 forks source link

Session does not continue if reconnecting to an agent after the controller stopped/crashed #888

Open pieqq opened 11 months ago

pieqq commented 11 months ago

Bug Description

While testing #859, I came across the following issue, likely with Checkbox controller.

If the controller reconnects to an agent after the current job is finished, the session does not continue to the next job, and instead stay stuck to the output from the current job.

To Reproduce

Setup

If needed, here are the steps I followed to setup my device to easily reproduce this issue:

Steps to setup the Checkbox controller and agent as well as some sample jobs and test plan ## Checkbox controller On my laptop, I already have a virtual environment setup for Checkbox. I just point to your branch: ``` (venv) $ git switch solve-resume-on-remote ``` I use this venv for the Checkbox controller. ## Checkbox agent For the Checkbox agent, I create an LXC container running 22.04: ``` $ lxc launch images:ubuntu/22.04 jammy $ lxc shell jammy ``` The rest of the commands are run in the container: ``` # apt install python3.10-venv python3-virtualenv git # git clone https://github.com/canonical/checkbox.git # cd checkbox/ # git switch solve-resume-on-remote ``` I follow the [Contrib guide](https://github.com/canonical/checkbox/blob/main/CONTRIBUTING.md#testing) to get Checkbox installed in a venv. In the end, checkbox-cli lives in `/root/checkbox/checkbox-ng/venv/bin/checkbox-cli` and the providers are in described in `/root/checkbox/checkbox-ng/venv/share/plainbox-providers-1`. I put the following in `/etc/systemd/system/checkbox-ng.service`: ``` [Unit] Description=Checkbox Remote Service Wants=network.target [Service] ExecStart=/root/checkbox/checkbox-ng/venv/bin/checkbox-cli run-agent SyslogIdentifier=checkbox-ng.service Environment="XDG_CACHE_HOME=/var/cache/" Environment="PROVIDERPATH=/root/checkbox/checkbox-ng/venv/share/plainbox-providers-1" Restart=always RestartSec=1 TimeoutStopSec=30 Type=simple [Install] WantedBy=multi-user.target ``` and I install the checkbox-ng service and start it: ``` # systemctl daemon-reload # systemctl enable checkbox-ng.service ``` Now, everything is in place. I can start a remote session from the controller by running: ``` (venv) $ checkbox-cli control ``` ## Sample jobs and test plan In the 22.04 container, I create a new `pieq.pxu` file in `/root/checkbox/providers/base/units/` and put the following in it: ``` unit: job id: pieq/test command: for i in $(seq 1 30); do echo "Iteration $i/30..." sleep 1 done flags: simple noreturn unit: job id: pieq/wrapup command: echo "Wrapping up..." flags: simple unit: test plan id: pieq _name: pieq include: pieq/test pieq/wrapup ``` the `pieq/test` job will run for 30 seconds and will show the current status of the job, so it's handy to see what's going on. It has the `noreturn` flag, but of course you can remove this flag if you want to test other use cases. I need to restart the systemd service, otherwise this test plan will not be visible to Checkbox: ``` # systemctl restart checkbox-ng.service ``` ## Launcher In order to simulate a non-interactive test run, I create the following launcher file (`pieq.launcher`): ``` [launcher] launcher_version = 1 app_id = com.canonical.certification:PR859 stock_reports = text [test plan] unit = com.canonical.certification::pieq forced = yes [test selection] forced = yes [ui] type = silent [transport:outfile] type = stream stream = stdout [exporter:text] unit = com.canonical.plainbox::text [report:screen] transport = outfile exporter = text ``` To run it from the controller side with: ``` (venv) $ checkbox-cli control pieq.launcher ```

Test

Reconnecting to agent after the controller stopped/crashed :x:

One of the issue this should fix is #22 , which mentions

While testing is ongoing, restart your host computer.

So:

  1. Run Checkbox remote using the launcher, which starts pieq/test (which runs for 30 seconds):
(venv) $ checkbox-cli control <IP of my lxc container> pieq.launcher

→ The test starts running

  1. Close the terminal where the controller is running. Wait for 30 seconds, then try reconnecting to the agent:
(venv) $ checkbox-cli control 10.146.223.75
$PROVIDERPATH is defined, so following provider sources are ignored ['/usr/local/share/plainbox-providers-1', '/usr/share/plainbox-providers-1', '/home/pieq/.local/share/plainbox-providers-1', '/var/tmp/checkbox-providers-develop'] 
Connecting to 10.146.223.75:18871. Timeout: 600s
Rejoined session.
In progress: com.canonical.certification::pieq/test (1/2)
Iteration 17/30...
Iteration 18/30...
Iteration 19/30...
Iteration 20/30...
Iteration 21/30...
Iteration 22/30...
Iteration 23/30...
Iteration 24/30...
Iteration 25/30...
Iteration 26/30...
Iteration 27/30...
Iteration 28/30...
Iteration 29/30...
Iteration 30/30...

aaaaaaaaand nothing happens. The session never goes on to the next job (pieq/wrapup), and never finishes. This is because the job has finished running by the time we reconnect to the agent.

Environment

Relevant log output

No response

Additional context

No response

syncronize-issues-to-jira[bot] commented 11 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/CHECKBOX-1078.

This message was autogenerated