fukamachi / woo

A fast non-blocking HTTP server on top of libev
http://ultra.wikia.com/wiki/Woo_(kaiju)
MIT License
1.27k stars 96 forks source link

Workers are selected sequentially and request processing may hang while free workers are available #100

Closed svetlyak40wt closed 1 year ago

svetlyak40wt commented 1 year ago

This leads to a situation when we finally encounter a busy worker and are waiting for it while other free workers are available.

How to reproduce

  1. Create a simple application with which will do (sleep 15) on /sleep URL and respond immediately on /.
  2. Start this app with :worker-num 4.
  3. In first console hit the sleep URL: curl localhost:8080/sleep
  4. In second console start executing curl localhost:8080/. First three attempt will return immediately, but fourth will wait for the worker processing request from the third step.

Expected behaviour

Free workers are reused, busy are not.

Probably this issue somehow related to this old discussion about performance and queues: Better worker mechanism .

fukamachi commented 1 year ago

Thank you for reporting. I'm happy that Woo has come to be used enough to get asked about this kind of internal problem.

Woo takes a round-robin approach assigning jobs to each worker, because it's simple to implement, easy to understand, and affects performance less in most cases. However, when there is a large variance in the execution time of each job, as you pointed out, some workers could be stuck with heavy jobs while some others are waiting. Since this depends on the application being run, I would prefer that the user be able to choose this scheduling method.

svetlyak40wt commented 1 year ago

Are there other scheduling methods I can try to switch to?

fukamachi commented 1 year ago

Nothing yet. Actually, I've been waiting for someone to point out that it has become a real problem.

svetlyak40wt commented 1 year ago

I beilieve the problem can be not only heavy jobs inside an application request handler, but also some malicious clients, doing Slow HTTP DOS Attack. I didn't check this yet, but quite sure the problem will be the same.

fukamachi commented 1 year ago

Alright. I think that a good place to start would be to find out what is going on, starting with adding a way to get Woo's internal statistics.

svetlyak40wt commented 1 year ago

I have no idea what kind of statistics could help here.

fukamachi commented 1 year ago

It is for the situation that you mentioned in the previous comment, while you don't seem to be sure what is going on.

I beilieve the problem can be not only heavy jobs inside an application request handler, but also some malicious clients, doing Slow HTTP DOS Attack. I didn't check this yet, but quite sure the problem will be the same.

svetlyak40wt commented 1 year ago

I think that during the Slow HTTP DOS Attack Woo will gave up sooner than other webserver with more advanced worker scheduler. Imagine, you have 100 workers. Usually, Slow HTTP attack can be made with 100 slow connections, whereas with Woo it will be enough to make one such connection.

svetlyak40wt commented 1 year ago

I'll try to find some time to test Woo and Hunchentoo on this kind of attacks.

svetlyak40wt commented 1 year ago

Slowloris test results

Today I've tested how Woo and Hunchentoot behave under the load during a Slowloris DoS attack. I found that Woo does is vulnerable in lesser form than Hunchentoot. But anyway, it is possible to put server down. At least in my server configuration it required about 1000 simulteneus slow connections whereas Hunchentoot lay down after the 100 (because it's thread pool is limited by 100 workers by default).

Conclusion:

Hunchentoot and Woo both vulnerable, but Woo requires 10 times more connections from attacker (however this is not a problem, because Slowlories does not require big bandwidth).

Probably some techniques can be applied to prevent this kind of attack on Woo. Wikipedia article lists a few of such techniques.

Slow workers test results

Also, I understood that Slowloris attack is not the problem I've started this issue for. I've started it because Woo hangs when some request processing requires significant amount of time. Here it still behaves worse than Hunchentoot.

To test this problem, I've created an app which sleeps 1 second before each response. Woo was started with :worker-num 10. Then I've started Apache Benchmark with concurrency of 20.

In this configuration server is able to serve only 10 requests per seconds. This means that with concurrency 20 some clients will wait and average response time should be about 2 seconds instead of 1. Test shows expected performance:

Concurrency Level:      20
Time taken for tests:   11.232 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      11400 bytes
HTML transferred:       1200 bytes
Requests per second:    8.90 [#/sec] (mean)
Time per request:       2246.415 [ms] (mean)
Time per request:       112.321 [ms] (mean, across all concurrent requests)
Transfer rate:          0.99 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        8   12   3.6     11      41
Processing:  1012 1915 318.8   1993    2133
Waiting:     1012 1915 318.8   1993    2133
Total:       1024 1927 318.8   2002    2145

Percentage of the requests served within a certain time (ms)
  50%   2002
  66%   2026
  75%   2027
  80%   2048
  90%   2143
  95%   2144
  98%   2144
  99%   2145
 100%   2145 (longest request)

When Woo is started in single thread mode, then as expected it takes about 20 seconds t process response when they come from 20 concurrent threads:

Concurrency Level:      20
Time taken for tests:   100.620 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      11400 bytes
HTML transferred:       1200 bytes
Requests per second:    0.99 [#/sec] (mean)
Time per request:       20124.062 [ms] (mean)
Time per request:       1006.203 [ms] (mean, across all concurrent requests)
Transfer rate:          0.11 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        8   23  25.3     10      74
Processing:  1279 18131 4176.7  19997   21080
Waiting:     1277 18131 4176.8  19997   21080
Total:       1297 18154 4181.8  20007   21092

Percentage of the requests served within a certain time (ms)
  50%  20007
  66%  20062
  75%  20171
  80%  20171
  90%  20172
  95%  20189
  98%  20219
  99%  21092
 100%  21092 (longest request)

Conclusion

For production we can protect ourself from a large variance in the execution time, by applying a timeout. Also a reasonable number of threads should be given as worker-num argument. But this won't protect against a Slowlori attack with a large number of simultaneous connections :(

Notes

For Slowloris attack I've used this tool running in a docker: https://github.com/shekyan/slowHttpTest

This is the code I've used to start the server: https://github.com/svetlyak40wt/slowloris-test

Here is the full video record of my investigation: https://www.twitch.tv/videos/1663942905

svetlyak40wt commented 1 year ago

@fukamachi I think this issue should be closed, because it is related to a scheduling issue and can be solved by placing a timeout on request processing.

But probably, you'll want to mitigate a Slowlori attacks and a new issue should be created instead.