Closed x6j8x closed 6 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/154882395
The labels on this github issue will be updated when the story is started.
(Shannon here)
@zrob @ahevenor
What do you make of this? I don't know why Trafficcontroller calls out to CC when Gorouter routes a websocket request it, but it seems to me Trafficcontroller shouldn't be passing HTTP responses it receives from CC back through to the client. If Trafficcontroller has some dependency on CC, either it should not accept the websocket connection from Gorouter if it can't satisfy this dependency, or it should disconnect the websocket connection if the call to CC fails, or it should mask the responses from CC. All Gorouter should know is whether Trafficcontroller accepts the websocket request or not.
@shalako Whether or not Loggregator is doing the right thing here (I think it does), gorouter should make sure that no WebSocket backend can basically "hijack" traffic by "misbehaving".
And to ensure this gorouter should validate if the backend successfully accepted the WebSocket connection or not (before switching to plain tcp mode).
Sascha, in your testing have you found that Loggregator does accept the websocket connection? If not, that could be something we can remedy. If so, how should gorouter recognize "misbehaving"? Once a TCP connection is established, gorouter should not be aware of anything above L4, right?
Shannon Coen Product Manager, Cloud Foundry Pivotal, Inc.
On Fri, Feb 2, 2018 at 10:21 AM, Sascha Matzke notifications@github.com wrote:
@shalako https://github.com/shalako Whether or not Loggregator is doing the right thing here, I think gorouter should make sure that no WebSocket backend can basically "hijack" traffic by "misbehaving".
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cloudfoundry/gorouter/issues/208#issuecomment-362664247, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJjmI0TPmOkE_GdvPevRPJ8K5YA05Ygks5tQ1JEgaJpZM4R3ga4 .
Loggregator accepts the TCP connection but never successfully completes the WebSocket Handshake (by returning a HTTP 101 Switching Protocols response).
I‘ve put misbehaving in quotes because this is actually 100% valid behavior. If a resource doesn‘t exist the backend must answer with a 404, because before the WebSocket handshake is complete, we‘re still in plain HTTP territory.
Gorouter must not establish a plain tcp bridge between client and backend before it saw a HTTP 101 response from the backend (only then is the WebSocket handshake complete and a switch to TCP safe).
@shalako I think the RFC 6455 is very clear in this:
The handshake from the server is much simpler than the client handshake. The first line is an HTTP Status-Line, with the status code 101:
HTTP/1.1 101 Switching Protocols
Any status code other than 101 indicates that the WebSocket handshake has not completed and that the semantics of HTTP still apply. The headers follow the status code.
According to the last paragraph, gorouter is clearly not in TCP area until it sees an server acknowledgement. Since this acknowledgement never reaches gorouter in the case of the issue, the semantics of HTTP still apply .
Please come up with a fix ASAP. The possibility that gorouter sends requests to the wrong apps is pretty scary....
Issue
Our users were experiencing random 404s for otherwise working urls. (see https://cloudfoundry.slack.com/archives/C05586PBX/p1513360375000396) This issue only appeared intermittently.
We believe that the issue is caused because
gorouter
does not check for a successful WebSocket upgrade when connecting to a backend. Leading (in combination with persistent HTTP connections) to a tunnel where request that should have been handled bygorouter
are blindly forwarded to the backend that didn't establish / upgrade to a WebSocket connection.Context
Analyzing this issue over several weeks now lead us to this (most likely) scenario as cause for the issue.
Involved components:
[1 Client] -> [2 AWS ALB] -> [3 GoRouter] -> [4 Loggregator] -> [5 cloud_controller ]
All incoming connections (plain http / websocket) share the same ALB listener (port 443).
Sequence of events:
Non 200 response from CC API: 404 for 5a507deb-48d1-4d14-ad51-d81493964100
and returns this 404 to the client - see L3 (https://github.com/cloudfoundry/loggregator/blob/9ed548bc30d69c1639b8bfdd676f4cc10c353677/trafficcontroller/internal/proxy/log_access_middleware.go#L31).Now we basically have established a "tunnel" from the ALB (2) straight to Loggregator (4)
Now, if another request comes in and ALB happens to pick C1 from its pool of connection to gorouter, the request hit loggregator, which (rightfully) returns a 404. see L4
Steps to Reproduce
Up until now we didn't manage to reproduce the issue willingly.
Expected result
If the WebSocket upgrade fails on the backend, gorouter must not establish a plain TCP connection (tunnel).
Possible Fix
Only upgrade to a plain TCP connection of gorouter can verify that the WebSocket upgrade was successful at the backend.
Logs:
L1 - ALB access log
L2 - gorouter log
L3 - Loggregator log access 404
L4 - Follow up request that get's tunneled to loggregator and results in a 404
Notice the ALB source port (63175) it's the same as in the WebSocket upgrade request in L2
ASCII extracted from pcap: