Intermittent Tor connection failures on Tails 4.20 when the network hook script runs

freedomofpress / securedrop

GitHub repository for the SecureDrop whistleblower platform. Do not submit tips here!

https://securedrop.org/

Other

3.62k stars 685 forks source link

Intermittent Tor connection failures on Tails 4.20 when the network hook script runs #5952

Open zenmonkeykstop opened 3 years ago

zenmonkeykstop commented 3 years ago

Description

With SD set up on Tails, a network hook script (or rather the securedrop_init.py script run by the hook) runs every time a network interface (that's not the loopback) comes up. This script copies Tor config changes into place and bounces Tor, to allow access to authenticated hidden services (JI, SSH).

In Tails 4.19rc1, Tor is now started manually by the user via the new Tor Connection app (TCA). Depending on timing, the network hook script may restart Tor while TCA is also starting it. Subsequent network calls in the script fails, meaning that the hook fails overall. TCA interprets this as Tor failing - this can result in a failed connection or the TCA reconnecting using bridges. This doesn't seem to happen for everybody - if TCA connects quickly enough then it isn't an issue.

Steps to Reproduce

Set up an Admin Workstation on Tails 4.19rc1
bounce the network connection, then connect to Tor using TCA
repeat until broken (if possible, on a throttled network connection.)

Expected Behavior

Tor connects without issue.

Actual Behavior

TCA assumes vanilla Tor connections don't work, falls back to default bridges. Network hook script not run completely.

Comments

Adding a time.sleep(30) after the tor restart in the hook seems to alleviate the issue. It might make sense have the script watch for Tor coming back up before proceeding.

eloquence commented 3 years ago

Still pending reproduction. At minimum during this sprint we may see another RC (and then the final at the end of the sprint period) and can do some additional testing here.

eloquence commented 3 years ago

The changes to the connection wizard have been deferred to Tails 4.20, due on 2021-07-13. Re-milestoning this issue to the next release.

eloquence commented 3 years ago

The Tor Connection Assistant is part of Tails 4.20 (I've re-titled the issue accordingly). I've not been able to reproduce the original issue, but given how timing-dependent it is, it's quite possible that it may still arise.

What I am seeing is a consistent brief flash of a connection error, the first time I connect to Tor (using defaults, without bridges):

torbad

This is likely triggered by our Tor restart here: https://github.com/freedomofpress/securedrop/blob/develop/install_files/ansible-base/roles/tails-config/files/securedrop_init.py#L91-L95

The assistant is smart enough to pick up that the connection was temporarily lost. It's also smart enough to recognize the recovery, and the error goes away. Still, this has significant potential for confusion.

eloquence commented 3 years ago

Here's what I've learned after another 90 minutes or so of investigation:

I've still not been able to reproduce the original issue, but I am consistently able to reproduce the intermittent error above. Since it is only transiently displayed to the user, and all connections work as expected even after subsequent bounces, it is IMO not necessarily urgent to resolve.
Tails initially sets DisableNetwork to 1 until the network is configured, and its TCA network hook waits until that's flipped to 0 before it completes. For this reason, the behavior on initial run is different than after subsequent bounces.
I only see the error above on initial run, not on subsequent bounces. On subsequent bounces, our hook fires while TCA connects, but I've not seen concurrency issues even after repeated runs.
To resolve both issues (assuming the original one may still arise), we could wait until the user exits the TCA before firing our network hook. TCA itself simply checks the process list via pgrep to determine if it is running, and we could do the same in our hook prior to the hook firing. As a result, the user would have to proactively close the window before SecureDrop-specific changes are applied, which is not ideal, but would sidestep concurrency issues.

For now, I'll open a docs PR with a screenshot of the TCA, and a temporary note about the error behavior above that users may observe.

zenmonkeykstop commented 3 years ago

Thanks for following up! That does seem like a related behaviour to my original report - it could just come down to timing/network speed. The fix of waiting for the TCA to finish before firing the network hook makes sense if we can do it, though I'd worry about users never closing the TCA window and then getting confused as to why their shortcuts aren't working. Will set up a 4.20 stick and attempt to reproduce both.

zenmonkeykstop commented 3 years ago

I'm seeing the error flash now as described above, no sign of the earlier more serious error. A fix would still be good but it may not be as urgent.

eloquence commented 3 years ago

I'm still seeing the flash error on Tails 4.21, but only on initial connect, not on reconnect. I'm also sometimes seeing a more serious error which is likely related: the Tails built-in update check complains that it cannot connect (it does not retry). In short, restarting Tor exactly at the same time that Tails is doing its best to connect to Tor and run its own update checks is probably not a sustainable way to go.

In this commit I am switching it to reload instead of restart (along with addressing #6042) which works fine -- in testing so far, it seems to fix both errors. However, this may not be a good long term solution either, as reloading an auth configuration is not supported if sandbox mode is enabled:

https://manpages.debian.org/testing/tor/torrc.5.en.html When the Sandbox is 1, the following options can not be changed when tor is running: Address, ConnLimit, CookieAuthFile, DirPortFrontPage, ExtORPortCookieAuthFile, Logs, ServerDNSResolvConfFile, ClientOnionAuthDir (and any files in it won’t reload on HUP signal).

Meanwhile, it looks like Tails is preparing to re-enable said sandbox mode: https://gitlab.tails.boum.org/tails/tails/-/issues/18237

If I understand correctly, the reload fix will basically stop working the moment that's done.

One other strategy that may be worth exploring:

Split the securedrop_init.py script into two, one for updating the Tor config, one for checking for updates
Run the Tor config script at startup and on ./securedrop-admin tailsconfig; run the update checker via network hook

In prelim testing, it seems to work fine to make the torrc additions before a connection is established; the additions appear to be preserved. It's only the update check portion of securedrop_init.py that requires a connection to run successfully.

eloquence commented 3 years ago

Run the Tor config script at startup

As far as I can tell, this would require running a script as root on startup, for which Tails does not appear to provide any facilities (.config/autostart can be persisted for running .desktop files as amnesia) - please let me know if I'm missing a way to do this.

Otherwise I think it would be best to ask upstream for implementation advice here, to make sure we handle our setup logic in a way that does not interfere with Tails' setup and update checks.

eloquence commented 3 years ago

I've tested the behavior in Tails 4.22 and still observe the following issues:

intermittent errors when Tails checks for updates, due to the enforced Tor restart interfering with said check
brief appearance of the (updated) error screen on network reconnects after the first one.

The first is the more serious issue and IMO a good reason to try to resolve this at least in the long run.

As expected, Tails 4.22 re-enabled Sandbox mode for Tor, which is incompatible with dynamically reloading our desired configuration changes without restarting the Tor service. So that easy option seems unavailable.

Per brief discussion with Kev today, I've investigated using the pre-up.d phase of the NetworkManager dispatcher to run the hook. This requires changes to our persistence configuration to create a symlink in a new location.

Because the network is not fully up at this stage, it would also require splitting the hook into two: 1) the Tor restart for the configuration change, 2) the update check, which can run as before while other stuff is going on.

Unfortunately, with Tails 4.22, the TCA enters a persistent error state if our hook runs before it tries to do its thing (I believe I was able to run it without issues prior to running TCA in previous Tails versions). That may be resolvable with a closer read of the TCA code, so I'm still cautiously optimistic that the pre-up.d approach can be made to work.

I've also investigated if Tails' shell functions may be of service in monitoring the state of the system before we do things, but so far I don't see how -- AFAICT, to safely restart Tor, we have to do so either before TCA runs or after the TCA window is closed by the user (because it actively polls for the Tor status while it is open) and the Tails update check has run.

zenmonkeykstop commented 3 years ago

I think if the pre-up script works, we don't need to do a tor restart - the pre-up would pop the auth-private files into place, Tails' own network hook would bounce Tor, and we'd do the upgrade check once it was complete. This would probably also mean a couple of changes in the tailsconfig Ansible run as well tho.

zenmonkeykstop commented 3 years ago

After spending some time with this issue I'm inclined to defer it:

on a 4.20 Tails install, I can't replicate the failed Tails updater, the updater check is proceeding successfully for me 100% of the time.
The initial connect is also good on 4.20 and 4.22, with no errors. I do see a momentary failure on network bumps but it corrects itself immediately

Splitting the securedrop_init.py script and executing the config part before the TCA comes up and the GUI updater part afterwards also works fine on initial connects - but once tca.service is started, any subsequent network bumps fail leaving the system torrc locked with DisableNetwork 1 set. In order to get the split to work as expected we'd probably need to disable the TCA service while we updated the config, which seems dodgy, and in light of the difficulty in reproducing the Tails updater error, probably not worth the effort.

A good fix for this overall would involve:

updating Tails (or our customisations) to persist the Tor client_auth directory and to include the necessary directive in the default torrc, so we don't need to update the tor config via securedrop_init.py (this would be a good general change IMO)
setting up a systemd service for the GUI updater, with dependencies set up such that it's only started once the Tor network is configured and up.

The scope of said good fix is probably too much to make it into the 2.1.0 milestone.