SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
https://mlserver.readthedocs.io/en/latest/
Apache License 2.0
717 stars 183 forks source link

effeciency with multiprocess: not work when setting 'parallel_workers > 1' #1046

Open ooooona opened 1 year ago

ooooona commented 1 year ago

Hi, I'm trying to improve throughput of my server which was running by MLServer, and I got to know that I can set 'parallel_workers > 1' to enable parallel. Hence I set it in settings.json as below:

{
    "parallel_workers": 10,
        "debug": "true"
}

Then I use ab(apche benchmark) to test my server, at the same time, I use top to monitor the usage of CPU and MEMORY. I can see that, the server really fork 10 process. However, only 1 process really worked while the others did nothing??? Screenshot 2023-03-15 at 23 19 07

the result of ab test showed that 'parallel_workers=1' has the same latency with 'parallel_workers=10': 'parallel_workers=10'

ab -n 10000 -c 10 -T  application/json -p sklearn-mlserver.json http://localhost:8080/v2/models/sklearn/infer
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:        uvicorn
Server Hostname:        localhost
Server Port:            8080

Document Path:          /v2/models/sklearn/infer
Document Length:        184 bytes

Concurrency Level:      10
Time taken for tests:   25.962 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      6220000 bytes
Total body sent:        3340000
HTML transferred:       1840000 bytes
Requests per second:    385.18 [#/sec] (mean)
Time per request:       25.962 [ms] (mean)
Time per request:       2.596 [ms] (mean, across all concurrent requests)
Transfer rate:          233.97 [Kbytes/sec] received
                        125.63 kb/s sent
                        359.60 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       3
Processing:     4   26   4.5     24      69
Waiting:        4   24   4.3     23      67
Total:          5   26   4.5     25      69

Percentage of the requests served within a certain time (ms)
  50%     25
  66%     26
  75%     27
  80%     28
  90%     30
  95%     34
  98%     42
  99%     46
 100%     69 (longest request)

'parallel_workers=1' Screenshot 2023-03-15 at 23 29 16

ab -n 10000 -c 10 -T  application/json -p sklearn-mlserver.json http://localhost:8080/v2/models/sklearn/infer
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:        uvicorn
Server Hostname:        localhost
Server Port:            8080

Document Path:          /v2/models/sklearn/infer
Document Length:        184 bytes

Concurrency Level:      10
Time taken for tests:   28.567 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      6220000 bytes
Total body sent:        3340000
HTML transferred:       1840000 bytes
Requests per second:    350.06 [#/sec] (mean)
Time per request:       28.567 [ms] (mean)
Time per request:       2.857 [ms] (mean, across all concurrent requests)
Transfer rate:          212.63 [Kbytes/sec] received
                        114.18 kb/s sent
                        326.81 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.3      0       5
Processing:     9   28  10.1     25     110
Waiting:        8   27   9.6     24     107
Total:          9   28  10.2     25     111

Percentage of the requests served within a certain time (ms)
  50%     25
  66%     27
  75%     28
  80%     29
  90%     35
  95%     49
  98%     66
  99%     76
 100%    111 (longest request)
adriangonz commented 1 year ago

Hey @ooooona ,

Depending on your benchmark settings, there could be not enough traffic to run on the other workers in parallel. Generally, MLServer will do a round-robin across workers. This is the case for MLServer > 1.2.0 (which version of MLServer are you using?).

However, if there aren't enough concurrent requests (e.g. when using a single client to send requests, or requests are processed too fast), each worker will complete processing each request before the next one comes in - effectively looking like only one of them is working.

ooooona commented 1 year ago

hi @adriangonz ,

  1. I checked my mlserver: mlserver, version 1.3.0.dev3. Actullay I build the image from git with commit 'eaa056371befccf74c66efc62192ffdd3c4a254e'.
  2. As you can see my testing command, my total request count is 10,000, for the first case, the concurrency is 10. I think the traffic is quit big. Today, I even tried with total request 100,000 and concurrency 100. top showed the same: only one process has CPU usage ~100%, while other ~0%, and the latency increased to 390ms<99p>(1 concurrency 12ms<99p>, 10 concurrency 45ms<99p>).

I also tried the same testing way on seldon-core-microservice, its multiprocess really worked, not only at CPU usage, but also improving throughput(reduce latency). So I think there might be something wrong with mlserver.

adriangonz commented 1 year ago

Hey @ooooona ,

Thanks for providing those details.

Could you share more info on the type of requests you are sending? How large are these?

Deserialisation happens on the main process - so if these are large requests, that could be a potential bottleneck.

ooooona commented 1 year ago

hi @adriangonz , sorry for my late reply. my message, my request is quite small:

$ cat sklearn-mlserver.json
{
    "inputs": [
        {
            "name": "args",
            "shape": [1,4],
            "datatype": "FP32",
            "data":  [10.1,13.5,1.4,0.2]
        }
    ]
}