cloudfoundry / gorouter

CF Router
Apache License 2.0
441 stars 226 forks source link

No persistent connection? #71

Closed smbyers closed 9 years ago

smbyers commented 9 years ago

I am running the hystrix dashboard in CF which connects to applications (also deployed in CF) to consume a metrics EventStream. When the connection is made using the CF route, the response contains a "Connection: close" thus no persistent connection and causes the dashboard to continually reconnect. However, if I connect directly to the DEA IP and port for the application's warden container, the "Content: close" header is not included in the response, the persistent connection is honored, and the hystrix dashboard correctly displays the metrics from the EventStream.

Is this a problem with the router or is there a limitation around support for EventStreams?

Thank you

cf-gitbot commented 9 years ago

We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/85797458.

luan commented 9 years ago

Hi @smbyers,

There is no known limitation on the router around support for EventStream (or any persistent HTTP connection for that matter), one thing to keep in mind though is that the router (or the load balancer, depending on it's configuration) will close idle connections after a timeout (default on the router is 15 minutes).

Do you know that your EventStream is constantly pushing data through or if it's idle for more than the timeout?

You can also configure the timeout (if you're deploying your own CF) by changing the request_timeout_in_seconds property here.

Thanks, @luan & @DanLavine, CF Runtime Team

smbyers commented 9 years ago

HI @luan,

The particular EventStream I'm using is the hystrix metrics stream. When I connect to the stream through the CF assigned route (browser, curl, or hystrix dashboard), I see the response come back with the "Connection: close". The data stream is active based on the http client debug logs on the hystrix dashboard. However, when connect directly to the stream using the DEA IP and port of the warden container where the stream is running, I can see in the logs a persistent connection is established and it works great. The connection close I see happens very quickly after connecting and even when a re-connect is performed by the client, the response comes back with another "Connection: close" header. The hystrix stream also sends a "ping" even when no real metrics data is flowing so it should never see the 15 minute timeout scenario unless something is wrong with the app itself.

Thanks, @smbyers

jfmyers9 commented 9 years ago

Hi @smbyers,

Could you provide us with some more information surrounding your issue. More specifically:

With some more information we should be better able to troubleshoot this issue.

Thanks,

@jfmyers9 && @kdeshpan, CF Runtime Team

smbyers commented 9 years ago

Hi @jfmyers9,

Below is some of the additional information you requested and I would also like to share some new findings.

Thank you very much

scottfrederick commented 9 years ago

I created a trivial Java app to help re-create this issue: https://github.com/scottfrederick/hystrix-sample. Instructions for building and running the app are in the README. This app exhibits all the same behavior that @smbyers describes in his comments.

crhino commented 9 years ago

@smbyers,

From what I understand, your problem seems to be with the HAProxy not keeping the client-facing connection alive, which will cause hystrix to continuously try and reconnect.

We have tried to recreate your problem on Bosh-Lite, an OpenStack environment, and PWS but we weren't able to. More specifically, @scottfrederick's sample app performed correctly on all three environments.

Going through HAProxy and curl we saw the connection remaining active at the end of the request, e.g.

curl -vvv hello.10.244.0.34.xip.io
...
* Connection #0 to host hello.10.244.0.34.xip.io left intact
Hello, world!

To be able to understand more and hopefully reproduce the problem, it would be very helpful to have an idea of what your HAProxy configuration looks like.

Thanks!

@crhino && @kdeshpan, CF Runtime Team

crhino commented 9 years ago

@scottfrederick We dug a little more into the curl hanging, since we were able to reproduce on my laptop. We spun up a Wireshark capture and did the curl request, and what we saw was that we were actually getting the TCP PSH packets with the ping: data at the network level. Somehow it was not showing up on the terminal though. The only difference we could find between my pair's laptop, which was able to see the pings, and mine was that he had a different version of OSX than I (I am on 10.9.5).

That is as far as we decided to dig, if you find anything else please do let us know.

scottfrederick commented 9 years ago

@smbyers What OS and version are you testing on? We have tested with a few Windows, Linux and OSX versions. This worked on the Windows and Linux versions tested, and on OSX 10.9.4 and Yosemite. OSX 10.9.5 doesn't work for us.

benjchristensen commented 9 years ago

I believe we had an issue on many Macs where SSE would not work at all from curl, Safari, etc.

/cc @ccarey-netflix who was tracking this for us so he can provide background.

ghost commented 9 years ago

About half of the Macs on my team were/are having this issue. Unfortunately it's not happening on any of my Macs - maybe one of the pieces of software I have installed is suppressing the issue somehow. Little Snitch, or some other firewall setting..? If I could get it to repro here it would be helpful.

We saw the exact same thing that @crhino saw with a packet capture where the data was actually hitting the machine, but the curl command (we think) is buffering the data and not outputting it. Sometimes if you wait long enough, or enough data arrives, those with the issue would see something output.

The interesting thing is those who have this issue with command line curl also suffer from the JavaScript (SSE) EventSource API (https://developer.mozilla.org/en-US/docs/Web/API/EventSource) failing to fire the onmessage callback in Chrome, Firefox, and Safari. So they are unable to watch a SSE sink with both curl or JavaScript. I'd like to find what underlying library EventSource API uses on Mac.

I put in a workaround for our app by running a Node server as a sidecar on our MantisUI servers and fire off the EventSource API from there. Then tunnel the results back to the browser over WebSockets. The same code that had trouble in the browser (for some) grabs the data fine from Node on Linux.

For those with the issue, it began with Mavericks 10.9 and persists with Yosemite.

Wish I had more information at this time. I'd love to put our heads together and get to the bottom of this issue.

smbyers commented 9 years ago

@scottfrederick I am running OSX 10.9.5. I am almost certain we also saw the problem with Chrome on one of our Windows PCs and maybe even with curl on Centos (would have to check with someone to validate Centos, though). If we need to revert the haproxy configuration to run some additional test scenarios to help out we can certainly do that.

crhino commented 9 years ago

@smbyers, we were able to reproduce your issue of a 'Connection: close' header being sent by the HAProxy.

We recently updated the HAProxy version in CF to 1.5.10 from 1.5-dev19. These two version had a different default connection mode, http-tunnel for 1.5-dev19 and http-keep-alive for 1.5.10. We were testing on 1.5.10 and thus saw no header. For more information, refer to this vcap-dev discussion. We downgraded to v195 and saw the header in the curl output.

CF v196 was just released today with this upgraded HAProxy version, so I would recommend upgrading your CF deployment to this new version.

Let us know if that works!

@crhino && @kdeshpan, CF Runtime Team

smbyers commented 9 years ago

@crhino, Thanks! I will look into upgrading (hopefully) within the next few days.

luan commented 9 years ago

I'm marking this as solved, please let us know if you run into any other issues.

Thanks, @luan & @sujoybasu, CF Runtime Team

smbyers commented 9 years ago

@luan / @crhino,

I upgraded my CF deployment to v196. I re-ran my tests and it does appear everything is functioning correctly now.

Thank you and everyone else for their assistance, Stephen