OpenVPN / tap-windows6

Windows TAP driver (NDIS 6)
Other
785 stars 237 forks source link

constants.h: make driver not halt on suspend #86

Closed lstipakov closed 4 years ago

lstipakov commented 5 years ago

This prevents ERROR_OPERATION_ABORTED from happening on wakeup, which causes openvpn2 to reconnect and openvpn3 to stop.

Signed-off-by: Lev Stipakov lev@openvpn.net

cron2 commented 5 years ago

Hi,

On Wed, May 22, 2019 at 03:03:59PM +0000, Lev Stipakov wrote:

This prevents ERROR_OPERATION_ABORTED from happening on wakeup, which causes openvpn2 to reconnect and openvpn3 to stop.

Looks good, but I have no idea what this might break...

@selvanair: is this as good as it sounds?

gert -- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress

Gert Doering - Munich, Germany gert@greenie.muc.de

lstipakov commented 5 years ago

That is what wintun also does - https://github.com/WireGuard/wintun/blob/master/wintun.c#L994

I believe after that change we won't necessary need this workaround

https://github.com/OpenVPN/openvpn/commit/ea66a2b5cdb21422139c421b4d3733e1c1c3937e

which basically makes fatal error, which happens on Windows resume (and maybe in other cases), nonfatal.

selvanair commented 5 years ago

I was under the impression that NDIS requires the driver to halt on suspend but I could be wrong. This used to cause OpenVPN2 also exit on suspend needing some hackery in the GUI etc.

We improved that by making openvpn2 do a delayed restart on tun/tap I/O abort so that on resume from sleep the connection just attempts again.

commit 0d4ba251879c702b9474e26ff73a4f559d922d4f Author: Selva Nair selva.nair@gmail.com Date: Wed Nov 4 13:59:38 2015 -0500

Fix termination when windows suspends/sleeps

When TUN/TAP I/O operation is aborted, restart with a SIGHUP instead of
terminate. The abort error from TAP is often triggered by system suspend
which is fully recoverable on resume. Catastrophic events will get caught
later during the restart.  This solves the abnormal termination during
suspend/resume.

Possibly OpenVPN3 could do the same.

Instead, if we opt to accept this PR it needs to be tested against OpenVPN2.

lstipakov commented 5 years ago

Documentation says that "Drivers that rely on hardware-maintained state should not set this flag." I think this is not our case?

selvanair commented 5 years ago

Unlike wintun tapwindows6 behaves like a network card though its virtual, so not sure how to interpret the docs. Anyway I leave the question of halt is needed or not to the driver experts..

Whether the halt is required or not, I do not think openvpn.exe should trigger FATAL errors unless absolutely there is no way to continue. A restart is often the right approach. Further, the current situation leads to a prompt restart of the connection on resume. A laptop that goes to sleep for, say, an hour definitely needs to restart the connection on resume as the server would have disconnected by then. In the current code the reconnect is triggered just before the suspend and nicely completes on resume. Else it may have to wait for a ping restart during which time the GUI will show a green icon as if the tunnel is active while its not.

As neither the GUI nor OpenVPN cares about suspend/resume for any reason other than interaction with the driver, getting some signal from the driver when suspend happens is the "right approach". That is, only the driver responds to suspend and, as a side effect, openvpn.exe restarts and GUI notices that the state has changed to reconnecting. IMO, this is a good thing. Else, the GUI will have to trigger a restart on resume but it has no idea whether the driver is ready or not. We used to have numerous complaints about not gracefully resuming from sleep and the current situation seems to be working well.

Finally, why should openvpn[23].exe cause a fatal error on tap I/O errors? Either trying again or restarting after a short delay looks a much better response. In that sense my commit linked to above was not a workaround, but a better way of handling tap I/O errors.

cron2 commented 5 years ago

Hi,

On Wed, May 22, 2019 at 09:28:50AM -0700, Selva Nair wrote:

Whether the halt is required or not, I do not think openvpn.exe should trigger FATAL errors unless absolutely there is no way to continue.

Right. Backing out the commit to OpenVPN is not what we should do (because there is nothing gained by doing so).

A restart is often the right approach. Further, the current situation leads to a prompt restart of the connection on resume. A laptop that goes to sleep for, say, an hour definitely needs to restart the connection on resume as the server would have disconnected by then. In the current code the reconnect is triggered just before the suspend and nicely completes on resume. Else it may have to wait for a ping restart during which time the GUI will show a green icon as if the tunnel is active while its not.

With less aggressive ping-timeout on the server, proto UDP and floating, we might not even have to reconnect... just open the laptop, and continue using the tunnel...

So a bit more testing, and then maybe some better recommendations on the combination of ping-timeout and floating is needed.

As neither the GUI nor OpenVPN cares about suspend resume for any reason other than interaction with the driver, getting some signal from the driver when suspend happens is the "right approach". That is, only the driver responds to suspend and, as a side effect, openvpn.exe restarts and GUI notices that the state has changed to reconnecting. IMO, this is a good thing. Else, the GUI will have to trigger a restart on resume but it has no idea whether the driver ready or not. We used to have numerous complaints about not gracefully resuming from sleep and the current situation seems to be working well.

The complaints was really about the old driver which just went away behind our backs, triggering a FATAL, and the GUI-triggered reconnect was timing sensitive at that. What if we have a driver that "just stays around" without dieing, and we can keep all the interface config etc (--persist-tun), and just need to (eventually) reconnect the openvpn sessions? Should be much faster restart.

Will we get anything back from the server if the client still thinks there is a session while the server has expired? Or will the server just drop our data packets?

Finally, why should openvpn[23].exe cause a fatal error on tap I/O errors? Either trying again or restarting after a short delay looks a much better response. In that sense my commit linked to above was not a workaround, but a better way of handling tap I/O errors.

I'm good with that commit. I just wonder if we really need to restart always :-)

gert -- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress

Gert Doering - Munich, Germany gert@greenie.muc.de

rozmansi commented 5 years ago

My 2 cents:

NDIS_MINIPORT_ATTRIBUTES_NO_HALT_ON_SUSPEND was introduced after the TAP-Windows6 initial release if I am correct. Maybe nobody had time to follow Microsoft's upgrade recommendations between NDIS versions.

A TAP-Windows6 device needs not stop on system sleep.

  1. It maintains no hardware needing to prepare for power-down and resume.
  2. It is just a software gateway: openvpn.exe <-> TAP-Windows6 device <-> Windows network stack

Handling resume after sleep is actually openvpn.exe's job. It runs the connection to the VPN server, that gets broken on sleeps longer than link timeout, right?

Imagine resume after sleep equal to temporary network blackout, or switching between WiFis.

lstipakov commented 5 years ago

I just tested openvpn 2.3.7 (which doesn't include Selva's fix) and latest 2.4.7 with this patched driver in tun mode and it works as expected - no "tun-abort" error after sleep/wakeup and VPN connection persists.

What comes to error fatality - at least in openvpn3 all tun i/o errors are not fatal except when we get ERROR_OPERATION_ABORTED (error 995). I have asked James why is that and he said that this was based on an old assumption that this kind of error is unrecoverable (I don't remember the exact phrasing). He also said that if restart works it would be fine to switch from fatal to non-fatal.

About restart on resume - I don't think that it is something that must be done unconditionally on every resume:

selvanair commented 5 years ago

If halt on suspend is not needed let's get rid of it.

Totally agree that unconditional restart of openvpn on resume is not appropriate. In fact we may need to do nothing at suspend/resume at all --- let keep-alive do its job triggering a soft restart only when necessary.

ERROR_OPERATION_ABORTED is received when the device shuts down which is a recoverable error if the shutdown is followed by a restart. For example, due to a limitation of our current "addtap" utility this happens when a new adapter is added. In this case 2.4 does a SIGHUP restart, 2.3 triggers a FATAL error.

mattock commented 4 years ago

Installed that includes this PR: https://build.openvpn.net/downloads/temp/tap-windows-9.24.2-I601.exe

mattock commented 4 years ago

Windows 10 / Server 2016 installer with PR #84 and #86:

mattock commented 4 years ago

@selvanair or somebody else: care to test the installer I linked to my latest comment? We are aiming for "2.4.8 Halloween release" (1 week from now) and would like to get this PR and #84 in there.

selvanair commented 4 years ago

@mattock its in my to do list, will do this weekend. I mentioned this before and it seems unsettled, so here goes again: how would this handle a suspend lasting longer than keep-alive? Currently tap halts and we trigger a SIGUSR1 restart which will complete in a few seconds after wake up. With this I suspect that the server will terminate the session but the client will take as long as ping-restart seconds to detect that and do a restart.

So without adding some suspend/resume handling in the core or the GUI, this is going to be bad for many current setups, I suspect..