Closed jamilbk closed 1 month ago
Hmm, only clearing cookies fixes it. If I sign out and then back in, the issue persists.
@jamilbk do have network logs (from Chrome Inspector) saved by any chance?
I think those get reset - might be tricky. Not sure if this is helpful:
https://github.com/user-attachments/assets/27751439-4d00-4202-a431-be8853f55fcf
It looks like my session is stuck like this so we can pair if you want
Ah, I had the Internet resource enabled. Disabling it fixed the loop. Maybe something to do with the LB and sessions? Wild guess
Happens on both staging and prod, so unlikely to be related to deploy and seems more like full-route somehow. This is going through the Vultr Gateway.
You can keep them between reloads:
Our web sessions are bound to the IP address, if it's changed - you should be signed out automatically. Does connlib do anything tricky with proxying to the portal? Can it send websocket over the tunnel and HTTP over internet or vice versa?
Was able to reproduce with internet resource. Looking into it.
The issue is that WS connection is sent over IPv6 and HTTP over IPv4+QUICK, so that the portal sees two different IP addresses (and we bind session to IP).
One of possible core issues is #6515.
More data points:
Physical interface stack | Browser | Protocol chosen | Tunnel stack chosen | Outcome |
---|---|---|---|---|
IPv4 + IPv6 |
Chrome | UDP/QUIC |
IPv4 |
Page reload loop |
IPv4 |
Chrome | TCP/HTTP |
IPv6 |
Page loads fine |
IPv6 |
Chrome | UDP/QUIC UDP |
IPv4 |
Page reload loop |
IPv4 + IPv6 |
Firefox | TCP/HTTP TCP |
IPv6 |
Page loads fine |
IPv4 + IPv6 No Firezone |
Chrome | TCP/HTTP TCP |
IPv6 |
Page loads fine |
Strangely, it appears that when we have physical IPv6 connectivity on the WiFi interface, Chrome chooses the tunnel IPv4 address + QUIC to load the app.firez.one web page over. WebSocket is always TCP/TLS v1.3.
Firefox doesn't exhibit the issue because it always chooses TCP and IPv6.
@AndrewDryga Disabling QUIC in Chrome fixes the issue. chrome://flags
I have a wild theory. Some cursory research suggests that Chrome will try to use QUIC if the site supports it (we do advertise it with this header: alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
) and Chrome thinks the connection will benefit from it.
Could it be possible that our use of the CGNAT space makes Chrome think we're on a cellular network, and thus prefer QUIC + IPv4 for connections? I wouldn't put it past the Chrome team to pull off such a stunt. Going to test with setting my tunnel interface to a different range.
EDIT: This isn't the case. QUIC works fine over either IPv4 or IPv6 upon further testing. The issue is two IPs.
Hmm, seeing as it's possible this can happen without Firezone enabled and has nothing to do with QUIC, maybe we should fix the session management to handle web page / websocket connecting from different IPs?
@jamilbk the only way to do that would be to stop checking for the session tokens :(
Hm, this seems like a Chrome bug then. I would imagine lots of sites that use WebSockets + sessions would suffer from this same issue?
The more I look into this issue, the more it looks like we might need to refactor our Token / auth logic to handle this case more gracefully. Currently the Token implementation makes the assumption that the WebSocket connection comes from the same source IP as the web page load since they both rely on the same token, right?
def use(%Token{} = token, %Auth.Context{} = context) do
token
|> change()
|> put_change(:last_seen_user_agent, context.user_agent)
|> put_change(:last_seen_remote_ip, %Postgrex.INET{address: context.remote_ip})
|> put_change(:last_seen_remote_ip_location_region, context.remote_ip_location_region)
|> put_change(:last_seen_remote_ip_location_city, context.remote_ip_location_city)
|> put_change(:last_seen_remote_ip_location_lat, context.remote_ip_location_lat)
|> put_change(:last_seen_remote_ip_location_lon, context.remote_ip_location_lon)
|> put_change(:last_seen_at, DateTime.utc_now())
|> validate_required(~w[last_seen_user_agent last_seen_remote_ip]a)
end
Since we're using one context and one token, but two potentially different IPs, could we simply update the Auth context for the browser token to either remove the last_seen_remote_ip, or split it into two fields, last_seen_remote_ipv4
and last_seen_remote_ipv6
?
The root of the issue seems to be that we assume the same connection parameters from the WebSocket as the page loads, which I think is valid for 99% of cases, but seems to be breaking down here.
I think this check is what's causing the bug we're seeing here:
defp maybe_enforce_token_context(
%Tokens.Token{type: token_type} = token,
%Context{type: context_type} = context
)
when token_type == :browser or context_type == :browser do
cond do
token.created_by_remote_ip.address != context.remote_ip -> {:error, :invalid_remote_ip}
token.created_by_user_agent != context.user_agent -> {:error, :invalid_user_agent}
true -> :ok
end
end
I see this every now and then. It seems to happen on tabs where I'm logging into the portal during a deploy. When the VM goes down and the WebSocket is cut, it signs me out. When I sign in again it triggers this behavior.
https://github.com/user-attachments/assets/247e1d95-9cee-4978-a4b4-53bd3d164daa