freedomofpress / securedrop-workstation

Qubes-based SecureDrop Journalist Workstation environment for submission handling
GNU Affero General Public License v3.0
141 stars 43 forks source link

"Failed to return clean data" in sd-log stage #519

Closed eloquence closed 3 years ago

eloquence commented 4 years ago

Seen previously during QA in the securedrop-admin --apply stage, and now also once during a prod install, the install sometimes fails during the sd-log stage with a "Failed to return clean data" error:

sd-log:
      ----------
      _error:
          Failed to return clean data
      retcode:
          None
      stderr:
      stdout:
          deploy
sd-gpg:
conorsch commented 4 years ago

Currently there are no open upstream issues that match https://github.com/QubesOS/qubes-issues/issues?q=is%3Aissue+%22failed+to+return+clean+data%22 Given how frequently I see this message myself, we're obligated to report upstream, so let's aim to collect more data here. I'll make sure to post the next time I see the error; let's also look for opportunities to make our logging more informative (#521).

eloquence commented 4 years ago

Since we have encountered this during a production install, @conorsch has committed to at least an initial investigation, time-boxed to 8 hours, during the 4/2-4/15 sprint.

eloquence commented 4 years ago

Observed this today during an update of the sd-app-buster-template VM (running current prod release, with the changes in https://github.com/freedomofpress/securedrop-workstation/tree/478-more-automatic-updates applied locally - which don't alter the update invocation). Given the frequency of updates, that's further increasing the potential severity of this issue. From /var/log/qubes/mgmt-sd-app-buster-template.log:

2020-04-08 12:03:48,928 calling 'state.sls update.qubes-vm'...
2020-04-08 12:04:35,283 output: sd-app-buster-template:
2020-04-08 12:04:35,284 output:     ----------
2020-04-08 12:04:35,284 output:     _error:
2020-04-08 12:04:35,284 output:         Failed to return clean data
2020-04-08 12:04:35,284 output:     retcode:
2020-04-08 12:04:35,284 output:         None
2020-04-08 12:04:35,284 output:     stderr:
2020-04-08 12:04:35,284 output:     stdout:
2020-04-08 12:04:35,284 output:         deploy
2020-04-08 12:04:35,284 exit code: 20
eloquence commented 4 years ago

@conorsch has agreed to continue to spend up to ~4 hours on further investigating this issue during the 4/22-5/6 sprint.

kushaldas commented 4 years ago

I did a timeboxed investigation. Though I saw this error many time before, I did not see it today.

This comes when the salt mgmt can not execute/read from the vm/remote/minion machine. In normal salt, salt-ssh is where to look. But, in our case it would be the corresponding qrexec calls, which may not be returning data back (maybe a timeout), or other errors in the vm. Maybe the qrexec service is not ready at all.

conorsch commented 4 years ago

Salt connection failures

The Qubes Salt logic occasionally fails with a message "Failed to return clean data". There are two categories of this error that we've observed to date:

  1. "Request refused" - #495
  2. "stdout: deploy" - #519

The following shows most detail on the "request refused" option, but timebox expired before the "stdout: deploy" option was deeply investigated.

Request refused

Example of error, from Salt logs /var/log/qubes/mgmt-sd-log-buster-template.log:

2020-04-29 08:48:42,287 calling 'state.sls update.qubes-vm'...
2020-04-29 08:52:05,744 output: sd-log-buster-template:
2020-04-29 08:52:05,746 output:     ----------
2020-04-29 08:52:05,746 output:     _error:
2020-04-29 08:52:05,746 output:         Failed to return clean data
2020-04-29 08:52:05,747 output:     retcode:
2020-04-29 08:52:05,747 output:         126
2020-04-29 08:52:05,747 output:     stderr:
2020-04-29 08:52:05,747 output:         Request refused
2020-04-29 08:52:05,747 output:     stdout:
2020-04-29 08:52:05,747 exit code: 20

Inspecting the syslog via journalctl -a -b | grep sd-log-buster-template, one sees:

Apr 29 08:50:30 dom0 qubesd[2468]: Starting sd-log-buster-template
Apr 29 08:50:56 dom0 qubesd[2468]: Activating the sd-log-buster-template VM
Apr 29 08:51:56 dom0 qubesd[2468]: Start failed: Cannot connect to qrexec agent for 60 seconds, see /var/log/xen/console/guest-sd-log-buster-template.log for details

Then, inside the console logs /var/log/xen/console/guest-sd-log-buster-template.log:

Console logs for failed template boot ``` [ 6.892610] piix4_smbus 0000:00:01.3: SMBus Host Controller not enabled! Setting up swapspace version 1, size = 1073737728 bytes UUID=7b756995-929f-4d44-9521-b72e45ad8a4e /dev/xvda3: clean, 136214/643376 files, 977812/2569467 blocks [ 9.162257] grsec: time set by /usr/lib/systemd/systemd[systemd:1] uid/euid:0/0 gid/egid:0/0, parent /[swapper/0:0] uid/euid:0/0 gid/egid:0/0 [.[0;1;31mFAILED.[0m] Failed to start .[0;1;39mLoad Kernel Modules.[0m. See 'systemctl status systemd-modules-load.service' for details. Starting .[0;1;39mQubes DB agent.[0m... Starting .[0;1;39mApply Kernel Variables.[0m... [.[0;32m OK .[0m] Started .[0;1;39mCreate Static Device Nodes in /dev.[0m. Starting .[0;1;39mudev Kernel Device Manager.[0m... [.[0;32m OK .[0m] Reached target .[0;1;39mLocal File Systems (Pre).[0m. [.[0;32m OK .[0m] Started .[0;1;39mudev Coldplug all Devices.[0m. Starting .[0;1;39mHelper to synchronize boot up for ifupdown.[0m... [.[0;32m OK .[0m] Started .[0;1;39mJournal Service.[0m. Starting .[0;1;39mFlush Journal to Persistent Storage.[0m... [.[0;32m OK .[0m] Started .[0;1;39mQubes DB agent.[0m. Starting .[0;1;39mInit Qubes Services settings.[0m... Starting .[0;1;39mLoad/Save Random Seed.[0m... [.[0;32m OK .[0m] Started .[0;1;39mApply Kernel Variables.[0m. [.[0;32m OK .[0m] Started .[0;1;39mHelper to synchronize boot up for ifupdown.[0m. [.[0;32m OK .[0m] Started .[0;1;39mFlush Journal to Persistent Storage.[0m. [.[0;32m OK .[0m] Started .[0;1;39mLoad/Save Random Seed.[0m. [.[0;32m OK .[0m] Started .[0;1;39mudev Kernel Device Manager.[0m. [.[0;32m OK .[0m] Started .[0;1;39mInit Qubes Services settings.[0m. Starting .[0;1;39mInitialize and mount /rw and /home.[0m... Starting .[0;1;39mAdjust root filesystem size.[0m... [.[0;32m OK .[0m] Started .[0;1;39mAdjust root filesystem size.[0m. [.[0;32m OK .[0m] Started .[0;1;39mInitialize and mount /rw and /home.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mLocal File Systems.[0m. Starting .[0;1;39mLoad AppArmor profiles.[0m... Starting .[0;1;39mCreate Volatile Files and Directories.[0m... Starting .[0;1;39mEarly Qubes VM settings.[0m... [.[0;32m OK .[0m] Started .[0;1;39mCreate Volatile Files and Directories.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mSystem Time Synchronized.[0m. Starting .[0;1;39mUpdate UTMP about System Boot/Shutdown.[0m... [.[0;32m OK .[0m] Started .[0;1;39mUpdate UTMP about System Boot/Shutdown.[0m. [.[0;32m OK .[0m] Started .[0;1;39mLoad AppArmor profiles.[0m. [.[0;32m OK .[0m] Started .[0;1;39mEntropy daemon using the HAVEGE algorithm.[0m. Starting .[0;1;39mRaise network interfaces.[0m... [.[0;32m OK .[0m] Started .[0;1;39mEarly Qubes VM settings.[0m. [.[0;32m OK .[0m] Started .[0;1;39mRaise network interfaces.[0m. [.[0;32m OK .[0m] Found device .[0;1;39m/dev/hvc0.[0m. [.[0;32m OK .[0m] Found device .[0;1;39m/dev/xvdc1.[0m. Activating swap .[0;1;39m/dev/xvdc1.[0m... [.[0;32m OK .[0m] Activated swap .[0;1;39m/dev/xvdc1.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mSwap.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mSystem Initialization.[0m. [.[0;32m OK .[0m] Started .[0;1;39mPeriodically check for updates.[0m. [.[0;32m OK .[0m] Started .[0;1;39mUpdate system time each 6h.[0m. [.[0;32m OK .[0m] Listening on .[0;1;39mD-Bus System Message Bus Socket.[0m. [.[0;32m OK .[0m] Started .[0;1;39mDaily man-db regeneration.[0m. [ 10.235268] Error: Driver 'pcspkr' is already registered, aborting... [.[0;32m OK .[0m] Listening on .[0;1;39mForward conne… updates proxy over Qubes RPC.[0m. [.[0;32m OK .[0m] Started .[0;1;39mCUPS Scheduler.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mPaths.[0m. [.[0;32m OK .[0m] Started .[0;1;39mDaily rotation of log files.[0m. [.[0;32m OK .[0m] Started .[0;1;39mDaily Cleanup of Temporary Directories.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mTimers.[0m. [.[0;32m OK .[0m] Listening on .[0;1;39mCUPS Scheduler.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mSockets.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mBasic System.[0m. Starting .[0;1;39mRestore /etc/reso… the ppp link was shut down.[0m... Starting .[0;1;39mQubes base firewall settings.[0m... Starting .[0;1;39mLogin Service.[0m... Starting .[0;1;39mQubes memory information reporter.[0m... Starting .[0;1;39mXen driver domain device daemon.[0m... Starting .[0;1;39mSystem Logging Service.[0m... Starting .[0;1;39mCUPS Scheduler.[0m... [.[0;32m OK .[0m] Started .[0;1;39mD-Bus System Message Bus.[0m. Starting .[0;1;39mWPA supplicant.[0m... [.[0;32m OK .[0m] Started .[0;1;39mSystem Logging Service.[0m. [.[0;32m OK .[0m] Started .[0;1;39mRestore /etc/resol…re the ppp link was shut down.[0m. [.[0;32m OK .[0m] Started .[0;1;39mQubes memory information reporter.[0m. [.[0;32m OK .[0m] Started .[0;1;39mXen driver domain device daemon.[0m. [.[0;32m OK .[0m] Started .[0;1;39mCUPS Scheduler.[0m. Starting .[0;1;39mRotate log files.[0m... Starting .[0;1;39mDaily man-db regeneration.[0m... Starting .[0;1;39mQubes remote exec agent.[0m... Starting .[0;1;39mQubes GUI Agent.[0m... [.[0;32m OK .[0m] Started .[0;1;39mLogin Service.[0m. [.[0;32m OK .[0m] Started .[0;1;39mWPA supplicant.[0m. [.[0;32m OK .[0m] Started .[0;1;39mQubes remote exec agent.[0m. [.[0;32m OK .[0m] Started .[0;1;39mUpdate time from ClockVM.[0m. [.[0;32m OK .[0m] Started .[0;1;39mQubes GUI Agent.[0m. Stopping .[0;1;39mCUPS Scheduler.[0m... [.[0;32m OK .[0m] Stopped .[0;1;39mCUPS Scheduler.[0m. Starting .[0;1;39mCUPS Scheduler.[0m... [.[0;32m OK .[0m] Started .[0;1;39mCUPS Scheduler.[0m. [.[0;32m OK .[0m] Started .[0;1;39mRotate log files.[0m. [ 10.699249] grsec: time set by /usr/bin/date[date:532] uid/euid:0/0 gid/egid:0/0, parent /usr/lib/qubes/qubes-sync-clock[qubes-sync-cloc:507] uid/euid:0/0 gid/egid:0/0 [.[0;32m OK .[0m] Started .[0;1;39mDaily man-db regeneration.[0m. [.[0;32m OK .[0m] Started .[0;1;39mQubes base firewall settings.[0m. Starting .[0;1;39mQubes misc post-boot actions.[0m... [.[0;32m OK .[0m] Reached target .[0;1;39mNetwork.[0m. Starting .[0;1;39m/etc/rc.local Compatibility.[0m... Starting .[0;1;39mPermit User Sessions.[0m... Starting .[0;1;39mAdvanced key-value store.[0m... [.[0;32m OK .[0m] Started .[0;1;39mPermit User Sessions.[0m. [.[0;32m OK .[0m] Started .[0;1;39m/etc/rc.local Compatibility.[0m. [.[0;32m OK .[0m] Created slice .[0;1;39mUser Slice of UID 1000.[0m. Starting .[0;1;39mUser Runtime Directory /run/user/1000.[0m... [.[0;32m OK .[0m] Started .[0;1;39mSerial Getty on hvc0.[0m. [.[0;32m OK .[0m] Started .[0;1;39mGetty on tty1.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mLogin Prompts.[0m. [.[0;32m OK .[0m] Started .[0;1;39mUser Runtime Directory /run/user/1000.[0m. Starting .[0;1;39mUser Manager for UID 1000.[0m... [.[0;32m OK .[0m] Started .[0;1;39mQubes misc post-boot actions.[0m. [.[0;32m OK .[0m] Started .[0;1;39mAdvanced key-value store.[0m. [.[0;32m OK .[0m] Reached target .[0;1;39mMulti-User System.[0m. Starting .[0;1;39mUpdate UTMP about System Runlevel Changes.[0m... [.[0;32m OK .[0m] Started .[0;1;39mUpdate UTMP about System Runlevel Changes.[0m. [.[0;32m OK .[0m] Started .[0;1;39mUser Manager for UID 1000.[0m. [.[0;32m OK .[0m] Started .[0;1;39mSession c3 of user user.[0m. [.[0;32m OK .[0m] Started .[0;1;39mSession c1 of user user.[0m. [.[0;32m OK .[0m] Started .[0;1;39mSession c2 of user user.[0m. Debian GNU/Linux 10 sd-log-buster-template hvc0 sd-log-buster-template login: [ 18.391892] reboot: Power down ```

Nothing of particular note. The "failed to start kernel modules" line is intriguing at first glance, but it occurs on every boot of that VM, which you can prove by comparing counts of that line to "reached target multi-user system":

[root@dom0 ~]# grep -ci 'failed to start.*load kernel modules' /var/log/xen/console/guest-sd-log-buster-template.log
6
[root@dom0 ~]# grep -ci 'reached target.*multi-user system' /var/log/xen/console/guest-sd-log-buster-template.log
6

The fact that the qrexec connection never succeeded means that Qubes deemed the VM boot as a failure, even though we can see from the logs the machine did start successfully. "Failed to return clean data" shows few hits on the Qubes issue tracker, but we see more for "cannot connect to qrexec agent".

We've previously noted that HVMs take significantly longer to boot. To illustrate (see test script):

[user@dom0 scripts]$ ./test-hvm-start-time.sh
Run time for test-hvm-1: 40s
Run time for test-pvh-1: 17s
Run time for test-hvm-2: 40s
Run time for test-pvh-2: 17s
Run time for test-hvm-3: 43s
Run time for test-pvh-3: 19s

The HVMs clearly take ~20s longer than PVH to boot. That suggests we may want to increase the qrexec_timeout value on the SDW VMs, since they mostly use HVM for virt_mode. The problem of qrexec delays is worse if memory is constrained; for example, restricting maxmem=400 (to simulate qubes.qmemman service failure) shows much longer run times on both HVM & PVH:

[user@dom0 scripts]$ ./test-hvm-start-time.sh
Run time for test-hvm-1: 59s
Run time for test-pvh-3: 74s
Run time for test-hvm-2: 79s
Run time for test-pvh-1: 79s
Run time for test-hvm-3: 86s
Run time for test-pvh-2: 87s

Since the default timeout is 60s, delays longer than that will cause a VM to be marked as failed and then killed. In fact, updating the test script above to use sudo qubesctl --show-output --skip-dom0 --target $vm_name state.sls update.qubes-vm shows the error message in question:

test-hvm-2-request-refused

With the matching error in the mgmt logs:

2020-05-04 18:52:12,530 calling 'state.sls update.qubes-vm'...
2020-05-04 18:55:52,519 output: test-hvm-2:
2020-05-04 18:55:52,521 output:     ----------
2020-05-04 18:55:52,521 output:     _error:
2020-05-04 18:55:52,521 output:         Failed to return clean data
2020-05-04 18:55:52,521 output:     retcode:
2020-05-04 18:55:52,521 output:         126
2020-05-04 18:55:52,522 output:     stderr:
2020-05-04 18:55:52,522 output:         Request refused
2020-05-04 18:55:52,522 output:     stdout:
2020-05-04 18:55:52,522 exit code: 20

stdout: deploy

Example failure:

2020-04-15 11:58:47,546 calling 'state.highstate'...
2020-04-15 11:59:32,975 output: sd-log-buster-template:
2020-04-15 11:59:32,976 output:     ----------
2020-04-15 11:59:32,976 output:     _error:
2020-04-15 11:59:32,976 output:         Failed to return clean data
2020-04-15 11:59:32,976 output:     retcode:
2020-04-15 11:59:32,976 output:         None
2020-04-15 11:59:32,976 output:     stderr:
2020-04-15 11:59:32,977 output:     stdout:
2020-04-15 11:59:32,977 output:         deploy
2020-04-15 11:59:32,977 exit code: 20

The failure here has been quite difficult to reproduce reliably. In order to track down occurrences locally, run in dom0:

sudo grep -rPI -B10 'output:\s+deploy' /var/log 

Further investigation required to establish a reliable reproduction of the issue.

eloquence commented 4 years ago

This is very useful, @conorsch, thanks for the super detailed recap. For the "Request refused" errors, should we wait for the qmemman fix to land and then re-test to see if the issue still occurs? Or do you think there's a case for adjusting timeout values regardless?

eloquence commented 4 years ago

Per above we've agreed to defer additional investigation on this for now, until at least the qmemman fixes land.

eloquence commented 4 years ago

Seeing this during an update of fedora-31 (with stdout deploy) today, here's the full output from the management VM log: https://gist.github.com/eloquence/58a937fb4c59607c88f2e17bbbc1972b

(This is with 0.3.0-rc2, staging; without qmemman fix manually applied)

eloquence commented 3 years ago

No reports of this issue since qmemman fixes landed upstream, proposing to close.

eloquence commented 3 years ago

Closing per above.