Session to portal reload loops if connection goes over Firezone

jamilbk commented 1 month ago

I see this every now and then. It seems to happen on tabs where I'm logging into the portal during a deploy. When the VM goes down and the WebSocket is cut, it signs me out. When I sign in again it triggers this behavior.

https://github.com/user-attachments/assets/247e1d95-9cee-4978-a4b4-53bd3d164daa

jamilbk commented 1 month ago

Hmm, only clearing cookies fixes it. If I sign out and then back in, the issue persists.

AndrewDryga commented 1 month ago

@jamilbk do have network logs (from Chrome Inspector) saved by any chance?

jamilbk commented 1 month ago

I think those get reset - might be tricky. Not sure if this is helpful:

https://github.com/user-attachments/assets/27751439-4d00-4202-a431-be8853f55fcf

It looks like my session is stuck like this so we can pair if you want

jamilbk commented 1 month ago

Ah, I had the Internet resource enabled. Disabling it fixed the loop. Maybe something to do with the LB and sessions? Wild guess

jamilbk commented 1 month ago

Happens on both staging and prod, so unlikely to be related to deploy and seems more like full-route somehow. This is going through the Vultr Gateway.

AndrewDryga commented 1 month ago

You can keep them between reloads:

Our web sessions are bound to the IP address, if it's changed - you should be signed out automatically. Does connlib do anything tricky with proxying to the portal? Can it send websocket over the tunnel and HTTP over internet or vice versa?

AndrewDryga commented 1 month ago

Was able to reproduce with internet resource. Looking into it.

AndrewDryga commented 1 month ago

The issue is that WS connection is sent over IPv6 and HTTP over IPv4+QUICK, so that the portal sees two different IP addresses (and we bind session to IP).

One of possible core issues is #6515.

jamilbk commented 1 month ago

More data points:

Physical interface stack	Browser	Protocol chosen	Tunnel stack chosen	Outcome
`IPv4` + `IPv6`	Chrome	`UDP/QUIC`	`IPv4`	Page reload loop
`IPv4`	Chrome	`TCP/HTTP`	`IPv6`	Page loads fine
`IPv6`	Chrome	`UDP/QUIC` UDP	`IPv4`	Page reload loop
`IPv4` + `IPv6`	Firefox	`TCP/HTTP` TCP	`IPv6`	Page loads fine
`IPv4` + `IPv6` No Firezone	Chrome	`TCP/HTTP` TCP	`IPv6`	Page loads fine

Strangely, it appears that when we have physical IPv6 connectivity on the WiFi interface, Chrome chooses the tunnel IPv4 address + QUIC to load the app.firez.one web page over. WebSocket is always TCP/TLS v1.3.

Firefox doesn't exhibit the issue because it always chooses TCP and IPv6.

jamilbk commented 1 month ago

@AndrewDryga Disabling QUIC in Chrome fixes the issue. chrome://flags

jamilbk commented 1 month ago

I have a wild theory. Some cursory research suggests that Chrome will try to use QUIC if the site supports it (we do advertise it with this header: alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000) and Chrome thinks the connection will benefit from it.

Could it be possible that our use of the CGNAT space makes Chrome think we're on a cellular network, and thus prefer QUIC + IPv4 for connections? I wouldn't put it past the Chrome team to pull off such a stunt. Going to test with setting my tunnel interface to a different range.

EDIT: This isn't the case. QUIC works fine over either IPv4 or IPv6 upon further testing. The issue is two IPs.

jamilbk commented 1 month ago

Hmm, seeing as it's possible this can happen without Firezone enabled and has nothing to do with QUIC, maybe we should fix the session management to handle web page / websocket connecting from different IPs?

AndrewDryga commented 1 month ago

@jamilbk the only way to do that would be to stop checking for the session tokens :(

jamilbk commented 1 month ago

Hm, this seems like a Chrome bug then. I would imagine lots of sites that use WebSockets + sessions would suffer from this same issue?

jamilbk commented 1 month ago

The more I look into this issue, the more it looks like we might need to refactor our Token / auth logic to handle this case more gracefully. Currently the Token implementation makes the assumption that the WebSocket connection comes from the same source IP as the web page load since they both rely on the same token, right?

  def use(%Token{} = token, %Auth.Context{} = context) do
    token
    |> change()
    |> put_change(:last_seen_user_agent, context.user_agent)
    |> put_change(:last_seen_remote_ip, %Postgrex.INET{address: context.remote_ip})
    |> put_change(:last_seen_remote_ip_location_region, context.remote_ip_location_region)
    |> put_change(:last_seen_remote_ip_location_city, context.remote_ip_location_city)
    |> put_change(:last_seen_remote_ip_location_lat, context.remote_ip_location_lat)
    |> put_change(:last_seen_remote_ip_location_lon, context.remote_ip_location_lon)
    |> put_change(:last_seen_at, DateTime.utc_now())
    |> validate_required(~w[last_seen_user_agent last_seen_remote_ip]a)
  end

Since we're using one context and one token, but two potentially different IPs, could we simply update the Auth context for the browser token to either remove the last_seen_remote_ip, or split it into two fields, last_seen_remote_ipv4 and last_seen_remote_ipv6?

The root of the issue seems to be that we assume the same connection parameters from the WebSocket as the page loads, which I think is valid for 99% of cases, but seems to be breaking down here.

jamilbk commented 1 month ago

I think this check is what's causing the bug we're seeing here:

  defp maybe_enforce_token_context(
         %Tokens.Token{type: token_type} = token,
         %Context{type: context_type} = context
       )
       when token_type == :browser or context_type == :browser do
    cond do
      token.created_by_remote_ip.address != context.remote_ip -> {:error, :invalid_remote_ip}
      token.created_by_user_agent != context.user_agent -> {:error, :invalid_user_agent}
      true -> :ok
    end
  end

firezone / firezone

Session to portal reload loops if connection goes over Firezone #6511