cloudfoundry / gorouter

CF Router
Apache License 2.0
441 stars 226 forks source link

gorouter does not check for successful backend websocket upgrade #208

Closed x6j8x closed 6 years ago

x6j8x commented 6 years ago

Issue

Our users were experiencing random 404s for otherwise working urls. (see https://cloudfoundry.slack.com/archives/C05586PBX/p1513360375000396) This issue only appeared intermittently.

We believe that the issue is caused because gorouter does not check for a successful WebSocket upgrade when connecting to a backend. Leading (in combination with persistent HTTP connections) to a tunnel where request that should have been handled by gorouter are blindly forwarded to the backend that didn't establish / upgrade to a WebSocket connection.

Context

Analyzing this issue over several weeks now lead us to this (most likely) scenario as cause for the issue.

Involved components:

[1 Client] -> [2 AWS ALB] -> [3 GoRouter] -> [4 Loggregator] -> [5 cloud_controller ]

All incoming connections (plain http / websocket) share the same ALB listener (port 443).

Sequence of events:

Now we basically have established a "tunnel" from the ALB (2) straight to Loggregator (4)

Now, if another request comes in and ALB happens to pick C1 from its pool of connection to gorouter, the request hit loggregator, which (rightfully) returns a 404. see L4

Steps to Reproduce

Up until now we didn't manage to reproduce the issue willingly.

Expected result

If the WebSocket upgrade fails on the backend, gorouter must not establish a plain TCP connection (tunnel).

Possible Fix

Only upgrade to a plain TCP connection of gorouter can verify that the WebSocket upgrade was successful at the backend.

Logs:

L1 - ALB access log

1..._elasticloadbalancing_eu-west-1_app.dilab-app-prod-apps-alb.cfc9a3d27629f775_20180130T0805Z_54.x.x.151_3ma3cl4s.log.gz:https 2018-01-30T08:04:07.050749Z app/dilab-app-prod-apps-alb/cfc9a3d27629f775 217.x.x.121:49974 10.x.x.75:80 0.000 0.023 0.000 404 404 1912 147 "GET https://doppler.sys.<cf_domain>:443/apps/5a507deb-48d1-4d14-ad51-d81493964100/stream HTTP/1.1" "Go-http-client/1.1" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-1:1...:targetgroup/dilab-app-prod-apps-http/3d151219c9a4b491 "Root=1-5a7026f7-6906254f2ff11eca65469161" "doppler.sys.<cf_domain>" "arn:aws:acm:eu-west-1:1....:certificate/6d28448a-c4d5-4165-a5e0-bd6d239ad4f7"

L2 - gorouter log

10.x.x.75 - gorouter [2018-01-30 08:04:07+0000]
{
    "log_level": 1,
    "timestamp": 1517299447.026743,
    "message": "handling-websocket-request",
    "source": "vcap.gorouter.proxy.request-handler",
    "data": {
        "RemoteAddr": "10.x.x.40:63175",
        "Host": "doppler.sys.<cf_domain>",
        "Path": "/apps/5a507deb-48d1-4d14-ad51-d81493964100/stream",
        "X-Forwarded-For": [
            "217.x.x.121"
        ],
        "X-Forwarded-Proto": [
            "https"
        ],
        "Upgrade": "websocket"
    }
}
[instance@47450 director="" deployment="cloudfoundry" group="router" az="z1" id="80484006-1f24-4e07-a956-25f13c8e9070"]

L3 - Loggregator log access 404

10.x.x.23 - loggregator_trafficcontroller CEF:0|cloud_foundry|loggregator_trafficcontroller|1.0|GET /apps/5a507deb-48d1-4d14-ad51-d81493964100/stream|GET /apps/5a507deb-48d1-4d14-ad51-d81493964100/stream|0|rt=1517299447972 cs1Label=userAuthenticationMechanism cs1=oauth-access-token cs2Label=vcapRequestId cs2=0842815b-12d8-454b-4e47-2d9d28738f08 request=/apps/5a507deb-48d1-4d14-ad51-d81493964100/stream requestMethod=GET src=217.x.x.121 spt= dst=10.x.x.23 dpt=8081 [instance@47450 director="" deployment="cloudfoundry" group="log-api" az="z2" id="ee719e00-09fb-4900-bbc2-d4f49e10922d"]
10.x.x.23 - vcap.trafficcontroller 2018/01/30 08:04:07 Non 200 response from CC API: 404 for 5a507deb-48d1-4d14-ad51-d81493964100 [instance@47450 director="" deployment="cloudfoundry" group="log-api" az="z2" id="ee719e00-09fb-4900-bbc2-d4f49e10922d"]

L4 - Follow up request that get's tunneled to loggregator and results in a 404

wireshark_screenshot

Notice the ALB source port (63175) it's the same as in the WebSocket upgrade request in L2

ASCII extracted from pcap:

GET /client-identity/v1/clients/a8...@apps.../services/customer-profile HTTP/1.1
X-Forwarded-For: 52.x.x.5
X-Forwarded-Proto: https
X-Forwarded-Port: 443
Host: identity.vwgroup.io
X-Amzn-Trace-Id: Root=1-5a702a...
Accept: application/json, application/*+json
X-NewRelic-ID: VQAOW...
X-NewRelic-Transaction: PxQHAgBQXAVRXAIGDg...
User-Agent: Java/1.8.0_162

HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 30 Jan 2018 08:17:49 GMT
Content-Length: 19
cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/154882395

The labels on this github issue will be updated when the story is started.

shubhaat commented 6 years ago

(Shannon here)

@zrob @ahevenor

What do you make of this? I don't know why Trafficcontroller calls out to CC when Gorouter routes a websocket request it, but it seems to me Trafficcontroller shouldn't be passing HTTP responses it receives from CC back through to the client. If Trafficcontroller has some dependency on CC, either it should not accept the websocket connection from Gorouter if it can't satisfy this dependency, or it should disconnect the websocket connection if the call to CC fails, or it should mask the responses from CC. All Gorouter should know is whether Trafficcontroller accepts the websocket request or not.

x6j8x commented 6 years ago

@shalako Whether or not Loggregator is doing the right thing here (I think it does), gorouter should make sure that no WebSocket backend can basically "hijack" traffic by "misbehaving".

And to ensure this gorouter should validate if the backend successfully accepted the WebSocket connection or not (before switching to plain tcp mode).

shalako commented 6 years ago

Sascha, in your testing have you found that Loggregator does accept the websocket connection? If not, that could be something we can remedy. If so, how should gorouter recognize "misbehaving"? Once a TCP connection is established, gorouter should not be aware of anything above L4, right?

Shannon Coen Product Manager, Cloud Foundry Pivotal, Inc.

On Fri, Feb 2, 2018 at 10:21 AM, Sascha Matzke notifications@github.com wrote:

@shalako https://github.com/shalako Whether or not Loggregator is doing the right thing here, I think gorouter should make sure that no WebSocket backend can basically "hijack" traffic by "misbehaving".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cloudfoundry/gorouter/issues/208#issuecomment-362664247, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJjmI0TPmOkE_GdvPevRPJ8K5YA05Ygks5tQ1JEgaJpZM4R3ga4 .

x6j8x commented 6 years ago

Loggregator accepts the TCP connection but never successfully completes the WebSocket Handshake (by returning a HTTP 101 Switching Protocols response).

I‘ve put misbehaving in quotes because this is actually 100% valid behavior. If a resource doesn‘t exist the backend must answer with a 404, because before the WebSocket handshake is complete, we‘re still in plain HTTP territory.

Gorouter must not establish a plain tcp bridge between client and backend before it saw a HTTP 101 response from the backend (only then is the WebSocket handshake complete and a switch to TCP safe).

Khoulaiz commented 6 years ago

@shalako I think the RFC 6455 is very clear in this:

The handshake from the server is much simpler than the client handshake. The first line is an HTTP Status-Line, with the status code 101:

   HTTP/1.1 101 Switching Protocols

Any status code other than 101 indicates that the WebSocket handshake has not completed and that the semantics of HTTP still apply. The headers follow the status code.

According to the last paragraph, gorouter is clearly not in TCP area until it sees an server acknowledgement. Since this acknowledgement never reaches gorouter in the case of the issue, the semantics of HTTP still apply .

Please come up with a fix ASAP. The possibility that gorouter sends requests to the wrong apps is pretty scary....