1.0.2 + SSL breaks everything

danielniccoli commented 11 years ago

Original author: hcarvalh...@gmail.com (June 17, 2010 07:24:47)

What steps will reproduce the problem?

Update to 1.0.2-1
Configure a vServer with FCGI and Media serving under SSL
Browse around

What is the expected output? Expected things to work.

What do you see instead? Random timeouts, partial content transfer and cherokee-worker using 100% CPU is seen instead.

What version of the product are you using? On what operating system? Cherokee 1.0.2-1~karmic~ppa / Ubuntu Karmic 64bit

Please provide any additional information below. Site was working perfectly with 1.0, the same config file.

Changing keep-alive or server timeout just leads to different rates of breakage.

NOT related to FCGI source neither media served, it works fine with other httpds (apache, lighty)

Original issue: http://code.google.com/p/cherokee/issues/detail?id=909

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 17, 2010 07:45:50 For the record, ldconfig output:

libssl.so.0.9.8 -> libssl.so.0.9.8
libgnutls-openssl.so.26 -> libgnutls-openssl.so.26.14.10

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 17, 2010 15:56:52 Still getting inconsistent behavior for SSL across my user base browsers. Some request simply refuse to transfer any content (blank in the browser), and some versions of Safari even crash when SSL is accessed.

What's going on with this release? Anyone with related problems?

danielniccoli commented 11 years ago

From ste...@konink.de on June 17, 2010 18:24:44 You are not the only one but pinpointing the exact cause is currently problematic.

http://code.google.com/p/cherokee/issues/detail?id=594 (Github: #575)

danielniccoli commented 11 years ago

From alobbs on June 20, 2010 12:58:47 http://svn.cherokee-project.com/changeset/5210 should fix part of the issue.

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 20, 2010 21:52:03 Looking forward to 1.0.3 to test this. For now rolling back to 0.9x. Thanks for the feedback

danielniccoli commented 11 years ago

From lnu...@gmail.com on June 21, 2010 14:45:50 Just tested the latest svn tarball cherokee-1.0.3b5215 and still hangs on SSL

danielniccoli commented 11 years ago

From alobbs on June 21, 2010 14:54:34 Leonel, are you sure of that? Doing what?

danielniccoli commented 11 years ago

From lnu...@gmail.com on June 21, 2010 17:37:28 Downloaded the latest svn tarball compiled buildted and tested with a self cert trying to access https://localhost/

The browser just sits there waiting ..

danielniccoli commented 11 years ago

From lnu...@gmail.com on June 21, 2010 17:45:46 If I open https://localhost/ the first 2 - 5 times responds ok then stops responding.

If I do an ab -c 10 -n 100 https://localhost/ the browser stops responding If I do ab ab -c 100 -n 1000 http://localhost/ the browser works fine

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 21, 2010 20:28:47 Thanks lnunez, testing with ab maybe shows the problem for me:

If we get many concurrent requests, cherokee-worker tops 100% cpu and freezes. Looks like it's somehow related to SSL.

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 23, 2010 01:23:55 Updated to 1.0.4 today and couldn't reproduce easily with Firefox as before. I'll test on production for the next day with real-world load (~10 req/s) and tracing enabled. I'll report the result if we have any problems, otherwise I guess this one can be marked as solved ;)

danielniccoli commented 11 years ago

From ste...@konink.de on June 23, 2010 01:34:28 I have benchmarked it today, but we should get about 100x more requests through as in in this release. Anyway, it is way better than before. So still not off the radar.

danielniccoli commented 11 years ago

From lnu...@gmail.com on June 23, 2010 02:38:25 Still on 1.0.4 I can only reload an https://localhost/ default page only 2 times before the server closes the connection

danielniccoli commented 11 years ago

From lnu...@gmail.com on June 23, 2010 02:41:17 I can't put this version on producction I need https and this bug makes me hold the upgrade

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 23, 2010 22:57:38 Still having cherokee-worker hanging at 100% CPU sporadically when accessing SSL with 1.0.4

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 23, 2010 23:00:26 Serving 2 SSL requests in a row makes cherokee-worker hang and consume all CPU. It's easy to reproduce the bug by reloading the browser twice on a SSL page.

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 23, 2010 23:06:40 Also, I noticed having keep-alive disabled renders SSL unusable, the server just drops the connection. With it enabled, SSL will work sometimes, and sometimes will hang cherokee-worker.

danielniccoli commented 11 years ago

From lnu...@gmail.com on June 24, 2010 00:21:41 This is why I took time yesterday to build the PPA packages. But read on irc that the ssl bugs where gone.

For me 1.0.4 with ssl still drops connections after the 3 page reload this with ubuntu packages from ppa and build from tar.gz

danielniccoli commented 11 years ago

From alobbs on June 24, 2010 08:02:58 The issue is partially fixed in trunk now. I have managed to get Cherokee to work fine with Chrome, although for some reason it's still misbehaving while serving content to Firefox.

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on June 25, 2010 20:42:19 Anyone working on this bug? Tried a build from latest source and still having the same issue (hanging cherokee-worker), can always reproduce it.

Built with tracing enabled. Are the trace messages of any utility?

Anything else that can be done from my side to help?

danielniccoli commented 11 years ago

From davisd.davisd@gmail.com on June 30, 2010 21:19:10 I'm having the same problem with 1.0.4, 1.0.3, and 1.0.2

Oddly, Chrome browser works fine... Using firefox causes problems.

1.0.1 works fine, I'll be running that until this is fixed.

-David

danielniccoli commented 11 years ago

From davisd.davisd@gmail.com on June 30, 2010 22:55:33 I should note that with 1.0.1, I'm getting periodic (Error code: sec_error_bad_signature) as in http://code.google.com/p/cherokee/issues/detail?id=594 (Github: #575) and I've got to restart cherokee.

I've had SSL problems since I started using Cherokee way back with 0.99.42 in February...

I've duplicated both problems on several ubuntu servers, 9.10, 10.04 and serveral arch linux servers... I've run different versions of cherokee, openssl, different machines, different virtual machines, different linux distributions.

I assume this bug 909 is directly related to the fixes for bug 594 ?

danielniccoli commented 11 years ago

From alobbs on July 01, 2010 05:57:53 David, I see the same behavior at my end. Chrome and Opera works alright, but FF fails when a connection is unexpectedly closed.

I've been struggling to find a consistent way to reproduce the issue. If you find some, please let me know. It'd be of great help.

About bug 594. I guess we fixed it along the way - at the same time that we introduced the regression we are currently talking about.

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on July 01, 2010 06:21:57 @alobbs

I cannot always reproduce Firefox reporting sec_error_bad_signature, but I can always reproduce the closed connections and cherokee-worker crashing.

I'm configuring cherokee for both http and https, default settings for keep-alive (enabled, default timeouts), and then trying to make 2 successive request from Firefox will always hang cherokee-worker at 100% and start timing out connections. From those, sometimes Firefox will report sec_error_bad_sig, sometimes it will load everything, and sometimes it will truncate the response. I guess it depends on which state cherokee-worker crashes.

Try with a HTML document linking to many stylesheet, images and everything being served thru SSL, instead of just a blank page. The behavior may be different with more concurrent requests.

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on July 01, 2010 06:34:19 @alobbs

Forgot saying that this behavior from last comment is with 1.0.4 and SVN.

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on July 23, 2010 22:30:44 Does 1.0.5 fixed this one?

danielniccoli commented 11 years ago

From alobbs on July 27, 2010 09:14:12 I'm afraid it did not.

danielniccoli commented 11 years ago

From prudhvik...@gmail.com on August 05, 2010 21:28:59 Is this issue fixed in 1.0.6?

danielniccoli commented 11 years ago

From lukasz.k...@gmail.com on August 10, 2010 06:05:02 1.0.6 still have problems on FF, IE 8 works fine.

danielniccoli commented 11 years ago

From prudhvik...@gmail.com on August 10, 2010 16:45:19 This is a major blocker for us to start running cherokee on Production. Is this fixed in 10.7?. Where can i find changelogs?

danielniccoli commented 11 years ago

From alobbs on August 10, 2010 16:47:59 We are currently working on it. A few hours ago a related patch made it to trunk, although the problem isn't fully solved yet: http://svn.cherokee-project.com/changeset/5363

danielniccoli commented 11 years ago

From lnu...@gmail.com on August 11, 2010 00:20:41 @prudhvikrishna Ubuntu packages at launchpad now have 2 repositories The current PPA NOW has cherokee 1.0.1 wich works perfect with SSL

and the NEW PPA repo named i-tse

You can read all about it here : http://lists.octality.com/pipermail/cherokee/2010-August/013274.html

So I recommend you to use as I do 1.0.1 on production once the ssl bug gets fixed this 1.0.1 will be upgraded to the version that fixes the problem

This is why there are 2 PPA repos ;)

Saludos

danielniccoli commented 11 years ago

From go.on....@googlemail.com on August 11, 2010 02:20:26 @alobbs Do you know, where this issue comes from? Or do you need more information about it or any help with this? My problem is, that our site starts in a few days and I would like to provide a ssl cert on it. But with version 1.0.4 it's impossible. Version 0.74 (I think) in Debian respos does not support webm streaming and compiling 1.0.1 is impossible to me because of some ffmpeg problems that are somehow unsolveable on Lenny.

Yet I'm really happy with Cherokee and the performance tests are great so far, but this (and the lack of aditional header support) is a real problem for our production environment. If you say, the problem might be solved in a few days or weeks, we will wait for it and use the time to optimize our webpage. If not, I might need to take a look at lighty. And I really don't want to do that, as I think cherokee is much better.

danielniccoli commented 11 years ago

From alobbs on August 11, 2010 06:45:19 @go.on.joe: I'm still investigation the issue. Hopefully it'll be fixed up soon, although I couldn't tell you when that will be for sure. It has already taken much more time than what I'd have expected.

danielniccoli commented 11 years ago

From alobbs on August 11, 2010 09:42:56 I believe the issue has been fixed up. Could you guys please give r5368 (or later) a try?

danielniccoli commented 11 years ago

From lnu...@gmail.com on August 11, 2010 10:40:02 Tested r5369

first set of tests and .... https on firefox it's working! \o/ YES !! I'll do more testing later

Thank you

danielniccoli commented 11 years ago

From hcarvalh...@gmail.com on August 11, 2010 18:08:40 That's good news. When I have time, I'll try this SVN rev on our staging server with all major browsers.

danielniccoli commented 11 years ago

From go.on....@googlemail.com on August 11, 2010 22:36:42 Ok, I recompiled X264, ffmpeg with shared libs and managed to compile cherokee. The firefox issue seems to be gone, but now I got another problem. I can output files with php up to about 20k - everything above just shows a blank file with no error message. :-(

Big static files work fine and the same php files without ssl also. Any idea, what the problem might be?

danielniccoli commented 11 years ago

From ste...@konink.de on August 11, 2010 22:48:51 @go.on.joe

Please log a separate bug for this.

danielniccoli commented 11 years ago

From go.on....@googlemail.com on August 11, 2010 22:51:23 Chrome shows "Error 100 (net::ERR_CONNECTION_CLOSED): Unknown Error" after a few reloads

Firefox shows blank page nearly everytime.

Opera 10.6 works fine most of the time, but sometimes cuts off parts of the file

danielniccoli commented 11 years ago

From ste...@konink.de on August 11, 2010 22:56:37 @go.on.joe

Please open another bug if this does not refer to the SSL bug. Are you using 1.0.8?

danielniccoli commented 11 years ago

From davisd.davisd@gmail.com on August 15, 2010 00:02:03 I upgraded two servers to 1.0.8 today and so far so good! Thanks for all of the hard work! I'll post back if there are problems.

danielniccoli commented 11 years ago

From alobbs on August 15, 2010 07:27:27 Sweet. Thanks for the feedback @davisd.davisd!

danielniccoli commented 11 years ago

From skar...@gmail.com on August 15, 2010 15:55:18 Only to say: Congratulations @alobbs

Tough bug fix... ;)

danielniccoli commented 11 years ago

From kallist...@gmail.com on September 24, 2010 15:25:28 uh-oh.. I just started seeing this in 1.0.8:

Chrome shows "Error 100 (net::ERR_CONNECTION_CLOSED): Unknown Error" after a few reloads

Firefox shows blank page nearly everytime.

danielniccoli commented 11 years ago

From woll...@gmail.com on October 16, 2010 00:31:26 We are using Cherokee (1.0.8) since the start of the term on a quite busy moodle-installation (eLearning) for our university and we where first quite happy about the enhanced smoothness... But we are using HTTPS for authentication - and we also encounter this bug. We made some experiments and analysis on our own to discover its nature, so perhaps I can contribute some useful information and gain help (?)

1) The bug appears here with SSL/fcgi-php5/chunked-encoding/content-compression/keep-alive. Static content gets delivered. PHP-content isn't delivered reliable. If there is just a small amount of Data send - the delivery probability is high. With just a few bytes exchanged its nearly 99% but drops to maybe 70% for larger outputs.

2) I'm not sure that bug is really Browser dependent. Its just that different Browsers act different, when Data that once was accessible is no longer available - some just present a cached version instead. The bug is easy detectable in our server logs (e.g. zero bytes delivered successfully - With larger files sometimes only truncated) and it seems every browser-type gets its share on errors randomly. In our case we were getting a unusual high complaint-rate about faulty logins from our students

3) Symptoms in Cherokees Error-Log: Some of these Problems are visible inside the error log. Its mostly lines like these 15/10/2010 16:07:10.508] (error) fdpoll-epoll.c:140 - epoll_ctl: ep_fd 18, fd 107: 'No such file or directory' the time corresponding directly to a login failure. Which results than in one defunct Thread(-connection?). If that happens to often in short time. The server can't deliver any php generated content for a while, not even via plain HTTP - showing "Gateway 504" (timeout) messages.

The total blackout obviously happens, when lots of students are logging in. But its not a real heavy load situation otherwise - plenty of free processor power, RAM and idling php-fcgis - especially after a while ;¬)

I didn't want to file a new bug. Because this bugreport seems to address the same problem - and also Issue 954 (Github: #707). (I'm not sure if it suggests to disable "chunked encoding" globally just to use SSL?)

By the way: The Server is running on OpenSuse 11.3 (x86_64) on Kernel 2.6.34.7, Cherokee 1.0.8-10.2, php 5.3.3-0.1.2,

danielniccoli commented 11 years ago

From ste...@konink.de on October 16, 2010 00:52:35 @wollatz how many concurrent php-cgi clients are you running? Is it possible that all your clients are saturated?

danielniccoli commented 11 years ago

From woll...@gmail.com on October 16, 2010 13:26:51 @konink.de Big thanks for your answer! I'm surprised about the swiftness.

12 concurrent php-cgi clients ... and yes, saturation was the case, due to an overloaded authentication server. It wasn't directly the SSL-problem - which took all the blame but is just loosely related: The new students tried different passwords randomly, after their real passwords got them a white page, witch triggered a delay feature of the password-Server... ;o(

I've just tested the SSL-error again and it always shows up with a delivered Http-header, no timeout, but also no (full) content. So, in the access.log it shows up as: xxx.xxx.116.62 - - [16/Oct/2010:13:24:02 +0200] "GET /css.php HTTP/1.1" 200 360 (..) instead of the correct: xxx.xxx.116.62 - - [16/Oct/2010:13:23:04 +0200] "GET /css.php HTTP/1.1" 200 14946 (..) Checked via liveHTTP - at least the delivered header seems identical in both cases. I've checked that error with different browsers - just the messages seem to differ but some try to cope by presenting an older cached version of the data.

It seems every SSL-delivered Page has a different but individually fixed "truncation point". Our userprofile-page is for one user truncated after 4973 bytes but is normally 17672 bytes big (gziped) the same script just showing a different user-profile is 17692 bytes big and if truncated just 4970 bytes. The truncation happens always at the same html-Tag! Inside php there seems to be nothing special at this point - no explicit "flush()" and not even a newline behind that last delivered HTML-tag, the following content is just generated by the next print-command. So perhaps some internal PHP-Buffer got filled up at about that position so the next (big) print() always delivers to the next chunk?

So that script always shows the same style of page-rump and the css.php script never shows any content - if they show faulty content. The probability of a corrupted Page is much higher (40-60%) for the more complex User-Profile Page, than for the login (produces redirect), or the css-output.

I wasn't able to get faulty pages by the WGET-command. But then I haven't tested with keep-alive, gzip and cookies...

Now I try to rewrite the authentication process so it is better suited to handle this situation at high traffic times. ;o) By the way what is the main drawback in globally disabling "chunked encoding"? I just found the notion that it is bad for "keep-alive Connections". Would a big download and parallel surfing on the same site create a problem in this case?

danielniccoli commented 11 years ago

From ste...@konink.de on October 16, 2010 13:35:00 Can you disable any stuff like GZIP/Deflate? The big pain is ofcourse that to debug it you would do very dirty things and I strongly advise not to do this in production. Maybe you can tryout the svn version and report back if you have the same issues there.

danielniccoli commented 11 years ago

From ste...@konink.de on October 16, 2010 14:04:41 I strongly feel that you are bitten by something additional...

1) the global network timeout (it basically kills the connection to the backend: php and kills the connection to your client) 2) your php timeout is probably higher than the network timeout (15s) so try to increase the timeout, that should reduce the 504 messages (and make php timeout before that error)

cherokee / webserver

1.0.2 + SSL breaks everything #691