Closed eloquence closed 3 years ago
Currently there are no open upstream issues that match https://github.com/QubesOS/qubes-issues/issues?q=is%3Aissue+%22failed+to+return+clean+data%22 Given how frequently I see this message myself, we're obligated to report upstream, so let's aim to collect more data here. I'll make sure to post the next time I see the error; let's also look for opportunities to make our logging more informative (#521).
Since we have encountered this during a production install, @conorsch has committed to at least an initial investigation, time-boxed to 8 hours, during the 4/2-4/15 sprint.
Observed this today during an update of the sd-app-buster-template
VM (running current prod release, with the changes in https://github.com/freedomofpress/securedrop-workstation/tree/478-more-automatic-updates applied locally - which don't alter the update invocation). Given the frequency of updates, that's further increasing the potential severity of this issue. From /var/log/qubes/mgmt-sd-app-buster-template.log
:
2020-04-08 12:03:48,928 calling 'state.sls update.qubes-vm'...
2020-04-08 12:04:35,283 output: sd-app-buster-template:
2020-04-08 12:04:35,284 output: ----------
2020-04-08 12:04:35,284 output: _error:
2020-04-08 12:04:35,284 output: Failed to return clean data
2020-04-08 12:04:35,284 output: retcode:
2020-04-08 12:04:35,284 output: None
2020-04-08 12:04:35,284 output: stderr:
2020-04-08 12:04:35,284 output: stdout:
2020-04-08 12:04:35,284 output: deploy
2020-04-08 12:04:35,284 exit code: 20
@conorsch has agreed to continue to spend up to ~4 hours on further investigating this issue during the 4/22-5/6 sprint.
I did a timeboxed investigation. Though I saw this error many time before, I did not see it today.
This comes when the salt mgmt can not execute/read from the vm/remote/minion machine. In normal salt, salt-ssh
is where to look. But, in our case it would be the corresponding qrexec calls, which may not be returning data back (maybe a timeout), or other errors in the vm. Maybe the qrexec service is not ready at all.
The Qubes Salt logic occasionally fails with a message "Failed to return clean data". There are two categories of this error that we've observed to date:
The following shows most detail on the "request refused" option, but timebox expired before the "stdout: deploy" option was deeply investigated.
Example of error, from Salt logs /var/log/qubes/mgmt-sd-log-buster-template.log
:
2020-04-29 08:48:42,287 calling 'state.sls update.qubes-vm'...
2020-04-29 08:52:05,744 output: sd-log-buster-template:
2020-04-29 08:52:05,746 output: ----------
2020-04-29 08:52:05,746 output: _error:
2020-04-29 08:52:05,746 output: Failed to return clean data
2020-04-29 08:52:05,747 output: retcode:
2020-04-29 08:52:05,747 output: 126
2020-04-29 08:52:05,747 output: stderr:
2020-04-29 08:52:05,747 output: Request refused
2020-04-29 08:52:05,747 output: stdout:
2020-04-29 08:52:05,747 exit code: 20
Inspecting the syslog via journalctl -a -b | grep sd-log-buster-template
, one sees:
Apr 29 08:50:30 dom0 qubesd[2468]: Starting sd-log-buster-template
Apr 29 08:50:56 dom0 qubesd[2468]: Activating the sd-log-buster-template VM
Apr 29 08:51:56 dom0 qubesd[2468]: Start failed: Cannot connect to qrexec agent for 60 seconds, see /var/log/xen/console/guest-sd-log-buster-template.log for details
Then, inside the console logs /var/log/xen/console/guest-sd-log-buster-template.log
:
Nothing of particular note. The "failed to start kernel modules" line is intriguing at first glance, but it occurs on every boot of that VM, which you can prove by comparing counts of that line to "reached target multi-user system":
[root@dom0 ~]# grep -ci 'failed to start.*load kernel modules' /var/log/xen/console/guest-sd-log-buster-template.log
6
[root@dom0 ~]# grep -ci 'reached target.*multi-user system' /var/log/xen/console/guest-sd-log-buster-template.log
6
The fact that the qrexec connection never succeeded means that Qubes deemed the VM boot as a failure, even though we can see from the logs the machine did start successfully. "Failed to return clean data" shows few hits on the Qubes issue tracker, but we see more for "cannot connect to qrexec agent".
We've previously noted that HVMs take significantly longer to boot. To illustrate (see test script):
[user@dom0 scripts]$ ./test-hvm-start-time.sh
Run time for test-hvm-1: 40s
Run time for test-pvh-1: 17s
Run time for test-hvm-2: 40s
Run time for test-pvh-2: 17s
Run time for test-hvm-3: 43s
Run time for test-pvh-3: 19s
The HVMs clearly take ~20s longer than PVH to boot. That suggests we may want to increase the qrexec_timeout
value on the SDW VMs, since they mostly use HVM for virt_mode. The
problem of qrexec delays is worse if memory is constrained; for example, restricting maxmem=400
(to simulate qubes.qmemman service failure) shows much longer run times on both HVM & PVH:
[user@dom0 scripts]$ ./test-hvm-start-time.sh
Run time for test-hvm-1: 59s
Run time for test-pvh-3: 74s
Run time for test-hvm-2: 79s
Run time for test-pvh-1: 79s
Run time for test-hvm-3: 86s
Run time for test-pvh-2: 87s
Since the default timeout is 60s, delays longer than that will cause a VM to be marked as failed and then killed. In fact, updating the test script above to use sudo qubesctl --show-output --skip-dom0 --target $vm_name state.sls update.qubes-vm
shows the error message in question:
With the matching error in the mgmt logs:
2020-05-04 18:52:12,530 calling 'state.sls update.qubes-vm'...
2020-05-04 18:55:52,519 output: test-hvm-2:
2020-05-04 18:55:52,521 output: ----------
2020-05-04 18:55:52,521 output: _error:
2020-05-04 18:55:52,521 output: Failed to return clean data
2020-05-04 18:55:52,521 output: retcode:
2020-05-04 18:55:52,521 output: 126
2020-05-04 18:55:52,522 output: stderr:
2020-05-04 18:55:52,522 output: Request refused
2020-05-04 18:55:52,522 output: stdout:
2020-05-04 18:55:52,522 exit code: 20
Example failure:
2020-04-15 11:58:47,546 calling 'state.highstate'...
2020-04-15 11:59:32,975 output: sd-log-buster-template:
2020-04-15 11:59:32,976 output: ----------
2020-04-15 11:59:32,976 output: _error:
2020-04-15 11:59:32,976 output: Failed to return clean data
2020-04-15 11:59:32,976 output: retcode:
2020-04-15 11:59:32,976 output: None
2020-04-15 11:59:32,976 output: stderr:
2020-04-15 11:59:32,977 output: stdout:
2020-04-15 11:59:32,977 output: deploy
2020-04-15 11:59:32,977 exit code: 20
The failure here has been quite difficult to reproduce reliably. In order to track down occurrences locally, run in dom0:
sudo grep -rPI -B10 'output:\s+deploy' /var/log
Further investigation required to establish a reliable reproduction of the issue.
This is very useful, @conorsch, thanks for the super detailed recap. For the "Request refused" errors, should we wait for the qmemman
fix to land and then re-test to see if the issue still occurs? Or do you think there's a case for adjusting timeout values regardless?
Per above we've agreed to defer additional investigation on this for now, until at least the qmemman
fixes land.
Seeing this during an update of fedora-31 (with stdout deploy
) today, here's the full output from the management VM log:
https://gist.github.com/eloquence/58a937fb4c59607c88f2e17bbbc1972b
(This is with 0.3.0-rc2, staging; without qmemman fix manually applied)
No reports of this issue since qmemman
fixes landed upstream, proposing to close.
Closing per above.
Seen previously during QA in the
securedrop-admin --apply
stage, and now also once during a prod install, the install sometimes fails during thesd-log
stage with a "Failed to return clean data" error: