Closed mwm5945 closed 3 years ago
K8s Version: 1.17.8 Istio Version: 1.2.5 (old i know :D )
A few questions to try to narrow things down:
Could this be something to investigate @mwm5945 https://github.com/SeldonIO/seldon-core/issues/931
I'll take a look at these... though im on vacation next week so it may be a little while :)
Does it happen without the istio sidecar?
yes.
Does it happen also with REST rather than gRPC?
We're GRPC throughout--testing with REST would be a bit difficult to be honest since i can't mix protocols like i could with the old engine
Is this under constant high load or does it happen under low load?
high Load
Does it happen with Flask (debug mode) as opposed to Gunicorn?
Since we're using GRPC, this doesn't actually affect anything (See https://github.com/SeldonIO/seldon-core/issues/3238)
Could this be something to investigate @mwm5945 #931
I doubt it, since the only thing that my "empty" graph is importing is opentracing (no other dependencies)
( @axsaucedo from our slack convo)
unsurprisingly, it looks like performance "normalizes" when the number of replicas increases from 2 to 5 (with about ~250 TPS), which leads me to believe there's something going on with python behind the scenes (maybe the GIL?)
Does it happen also with REST rather than gRPC?
I've been able to test locally with a real SKLearn model (not just empty), and REST does perform significantly better than GRPC: | REST (default) | REST (10workers, 1 Thread) | GRPC (1 Process, 10 threads) | GRPC (10 Processess, 1 Thread) | |
---|---|---|---|---|---|
1 Client Thread | 95ms | 98ms | 93ms | 93ms | |
5 Client Threads | 836ms | 140ms | 863.83ms | 147.5ms |
I've created my own version of the python microservice, and implemented grpc mutliprocessing as done here: https://github.com/grpc/grpc/tree/master/examples/python/multiprocessing
As per our discussion in Zoom, i'll re-create the above test using a mock model for replication on the Seldon side, and i'll include p* stats as well. (Cc: @axsaucedo @RafalSkolasinski )
Thank you @mwm5945 - as discussed, here's the conclusions from the discussion earlier today:
There are similar points reported on REST:
Based on these the action points are:
Testing on local machinen via a Python 3.7 Container, and a basic SKLearn Model
❯ ghz --proto prediction.proto -i . --insecure -c 1 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000
Summary:
Count: 15156
Total: 60.00 s
Slowest: 108.71 ms
Fastest: 2.57 ms
Average: 3.91 ms
Requests/sec: 252.60
Response time histogram:
2.569 [1] |
13.184 [15126] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
23.798 [20] |
34.413 [2] |
45.027 [2] |
55.641 [2] |
66.256 [0] |
76.870 [0] |
87.484 [1] |
98.099 [0] |
108.713 [1] |
Latency distribution:
10 % in 3.12 ms
25 % in 3.36 ms
50 % in 3.71 ms
75 % in 4.16 ms
90 % in 4.74 ms
95 % in 5.28 ms
99 % in 7.22 ms
Status code distribution:
[OK] 15155 responses
[Unavailable] 1 responses
Error distribution:
[1] rpc error: code = Unavailable desc = transport is closing
❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000
Summary:
Count: 13117
Total: 60.00 s
Slowest: 262.70 ms
Fastest: 4.72 ms
Average: 22.82 ms
Requests/sec: 218.62
Response time histogram:
4.718 [1] |
30.515 [11540] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
56.313 [1483] |∎∎∎∎∎
82.111 [58] |
107.909 [5] |
133.707 [7] |
159.505 [3] |
185.303 [4] |
211.101 [1] |
236.899 [5] |
262.696 [5] |
Latency distribution:
10 % in 13.57 ms
25 % in 17.72 ms
50 % in 21.95 ms
75 % in 26.56 ms
90 % in 31.36 ms
95 % in 34.80 ms
99 % in 47.87 ms
Status code distribution:
[Unavailable] 5 responses
[OK] 13112 responses
Error distribution:
[5] rpc error: code = Unavailable desc = transport is closing
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=1 | vegeta report -type=text
Requests [total, rate, throughput] 7646, 127.43, 127.42
Duration [total, attack, wait] 1m0s, 1m0s, 7.531ms
Latencies [min, mean, 50, 90, 95, 99, max] 5.14ms, 7.845ms, 7.416ms, 8.665ms, 9.528ms, 13.803ms, 919.981ms
Bytes In [total, mean] 443468, 58.00
Bytes Out [total, mean] 237026, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:7646
Error Set:
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=10 | vegeta report -type=text
Requests [total, rate, throughput] 7515, 125.23, 125.16
Duration [total, attack, wait] 1m0s, 1m0s, 35.299ms
Latencies [min, mean, 50, 90, 95, 99, max] 12.396ms, 79.879ms, 76.715ms, 103.414ms, 115.258ms, 152.349ms, 246.422ms
Bytes In [total, mean] 435870, 58.00
Bytes Out [total, mean] 232965, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:7515
Error Set:
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=1 | vegeta report -type=text
Requests [total, rate, throughput] 8349, 139.13, 139.12
Duration [total, attack, wait] 1m0s, 1m0s, 7.65ms
Latencies [min, mean, 50, 90, 95, 99, max] 4.964ms, 7.185ms, 6.816ms, 8.378ms, 9.212ms, 12.838ms, 54.27ms
Bytes In [total, mean] 484242, 58.00
Bytes Out [total, mean] 258819, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:8349
Error Set:
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=5 | vegeta report -type=text
Requests [total, rate, throughput] 14290, 238.15, 238.09
Duration [total, attack, wait] 1m0s, 1m0s, 15.183ms
Latencies [min, mean, 50, 90, 95, 99, max] 6.602ms, 20.99ms, 17.953ms, 28.09ms, 32.741ms, 60.559ms, 997.28ms
Bytes In [total, mean] 828820, 58.00
Bytes Out [total, mean] 442990, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:14290
Error Set:
❯ ghz --proto prediction.proto -i . --insecure -c 1 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000
Summary:
Count: 12405
Total: 60.00 s
Slowest: 63.22 ms
Fastest: 3.09 ms
Average: 4.79 ms
Requests/sec: 206.75
Response time histogram:
3.090 [1] |
9.102 [12259] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
15.115 [79] |
21.128 [51] |
27.140 [9] |
33.153 [3] |
39.166 [0] |
45.178 [1] |
51.191 [0] |
57.204 [0] |
63.217 [1] |
Latency distribution:
10 % in 3.76 ms
25 % in 4.06 ms
50 % in 4.49 ms
75 % in 5.09 ms
90 % in 5.88 ms
95 % in 6.57 ms
99 % in 9.75 ms
Status code distribution:
[Unavailable] 1 responses
[OK] 12404 responses
Error distribution:
[1] rpc error: code = Unavailable desc = transport is closing
❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000
Summary:
Count: 10313
Total: 60.00 s
Slowest: 182.28 ms
Fastest: 6.22 ms
Average: 29.04 ms
Requests/sec: 171.88
Response time histogram:
6.223 [1] |
23.829 [2607] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
41.434 [6999] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
59.040 [634] |∎∎∎∎
76.645 [49] |
94.251 [7] |
111.856 [4] |
129.462 [1] |
147.067 [4] |
164.673 [1] |
182.278 [1] |
Latency distribution:
10 % in 19.52 ms
25 % in 23.76 ms
50 % in 28.26 ms
75 % in 33.19 ms
90 % in 38.85 ms
95 % in 43.57 ms
99 % in 56.12 ms
Status code distribution:
[OK] 10308 responses
[Unavailable] 5 responses
Error distribution:
[5] rpc error: code = Unavailable desc = transport is closing
❯ ghz --proto prediction.proto -i . --insecure -c 1 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000
Summary:
Count: 13514
Total: 60.00 s
Slowest: 125.90 ms
Fastest: 2.91 ms
Average: 4.39 ms
Requests/sec: 225.23
Response time histogram:
2.911 [1] |
15.211 [13491] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
27.510 [13] |
39.809 [3] |
52.108 [0] |
64.408 [1] |
76.707 [1] |
89.006 [2] |
101.306 [0] |
113.605 [0] |
125.904 [1] |
Latency distribution:
10 % in 3.57 ms
25 % in 3.92 ms
50 % in 4.26 ms
75 % in 4.65 ms
90 % in 5.11 ms
95 % in 5.47 ms
99 % in 7.00 ms
Status code distribution:
[OK] 13513 responses
[Unavailable] 1 responses
Error distribution:
[1] rpc error: code = Unavailable desc = transport is closing
❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000
Summary:
Count: 13735
Total: 60.00 s
Slowest: 134.07 ms
Fastest: 4.68 ms
Average: 21.79 ms
Requests/sec: 228.92
Response time histogram:
4.678 [1] |
17.617 [663] |∎∎
30.557 [12741] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
43.497 [242] |∎
56.436 [33] |
69.376 [24] |
82.316 [13] |
95.255 [6] |
108.195 [4] |
121.135 [2] |
134.074 [1] |
Latency distribution:
10 % in 18.24 ms
25 % in 19.69 ms
50 % in 21.08 ms
75 % in 22.81 ms
90 % in 25.15 ms
95 % in 27.45 ms
99 % in 36.96 ms
Status code distribution:
[OK] 13730 responses
[Unavailable] 5 responses
Error distribution:
[5] rpc error: code = Unavailable desc = transport is closing
❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=5 localhost:5000
Summary:
Count: 22896
Total: 60.00 s
Slowest: 55.57 ms
Fastest: 5.36 ms
Average: 13.03 ms
Requests/sec: 381.60
Response time histogram:
5.362 [1] |
10.383 [4107] |∎∎∎∎∎∎∎∎∎∎∎
15.404 [14815] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
20.425 [3291] |∎∎∎∎∎∎∎∎∎
25.447 [486] |∎
30.468 [112] |
35.489 [38] |
40.510 [22] |
45.531 [11] |
50.552 [4] |
55.574 [4] |
Latency distribution:
10 % in 9.61 ms
25 % in 10.92 ms
50 % in 12.49 ms
75 % in 14.41 ms
90 % in 16.91 ms
95 % in 18.86 ms
99 % in 24.67 ms
Status code distribution:
[OK] 22891 responses
[Unavailable] 5 responses
Error distribution:
[5] rpc error: code = Unavailable desc = transport is closing
❯ ghz --proto prediction.proto -i . --insecure -c 10 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=10 localhost:5000
Summary:
Count: 22931
Total: 60.00 s
Slowest: 394.52 ms
Fastest: 6.53 ms
Average: 26.09 ms
Requests/sec: 382.18
Response time histogram:
6.525 [1] |
45.324 [20152] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
84.124 [2645] |∎∎∎∎∎
122.923 [111] |
161.722 [3] |
200.521 [1] |
239.321 [0] |
278.120 [3] |
316.919 [2] |
355.719 [1] |
394.518 [2] |
Latency distribution:
10 % in 15.05 ms
25 % in 17.61 ms
50 % in 20.93 ms
75 % in 26.42 ms
90 % in 51.65 ms
95 % in 62.43 ms
99 % in 78.89 ms
Status code distribution:
[OK] 22921 responses
[Unavailable] 10 responses
Error distribution:
[10] rpc error: code = Unavailable desc = transport is closing
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=1 | vegeta report -type=text
Requests [total, rate, throughput] 11183, 186.38, 186.36
Duration [total, attack, wait] 1m0s, 1m0s, 5.486ms
Latencies [min, mean, 50, 90, 95, 99, max] 3.319ms, 5.364ms, 5.09ms, 6.513ms, 7.297ms, 10.422ms, 253.062ms
Bytes In [total, mean] 648614, 58.00
Bytes Out [total, mean] 346673, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:11183
Error Set:
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=5 | vegeta report -type=text
Requests [total, rate, throughput] 9026, 150.42, 150.38
Duration [total, attack, wait] 1m0s, 1m0s, 16.055ms
Latencies [min, mean, 50, 90, 95, 99, max] 6.069ms, 33.245ms, 32.001ms, 42.656ms, 46.568ms, 55.496ms, 402.204ms
Bytes In [total, mean] 523508, 58.00
Bytes Out [total, mean] 279806, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:9026
Error Set:
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=1 | vegeta report -type=text
Requests [total, rate, throughput] 6973, 116.20, 116.19
Duration [total, attack, wait] 1m0s, 1m0s, 8.144ms
Latencies [min, mean, 50, 90, 95, 99, max] 5.607ms, 8.603ms, 8.231ms, 9.7ms, 10.404ms, 12.848ms, 385.654ms
Bytes In [total, mean] 404434, 58.00
Bytes Out [total, mean] 216163, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:6973
Error Set:
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=5 | vegeta report -type=text
Requests [total, rate, throughput] 12283, 204.43, 204.40
Duration [total, attack, wait] 1m0s, 1m0s, 10.642ms
Latencies [min, mean, 50, 90, 95, 99, max] 7.442ms, 24.451ms, 21.396ms, 31.013ms, 35.486ms, 69.883ms, 1.024s
Bytes In [total, mean] 712414, 58.00
Bytes Out [total, mean] 380773, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:12283
Error Set:
❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
vegeta attack -format=json -duration=60s -rate=0 -max-workers=10 | vegeta report -type=text
Requests [total, rate, throughput] 11764, 195.93, 195.89
Duration [total, attack, wait] 1m0s, 1m0s, 10.477ms
Latencies [min, mean, 50, 90, 95, 99, max] 10.477ms, 51.03ms, 45.36ms, 64.456ms, 76.103ms, 173.435ms, 1.055s
Bytes In [total, mean] 682312, 58.00
Bytes Out [total, mean] 364684, 31.00
Success [ratio] 100.00%
Status Codes [code:count] 200:11764
Error Set:
ModelMain.py:
from collections.abc import Iterable
import sys
import logging
import joblib
import numpy
logger = logging.getLogger(__name__)
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('[PID %(process)d - TID %(thread)d] %(lineno)d %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)
class SKR(object):
def __init__(self):
logger.info("Initializing model")
self.class_names = ["random"] # model output labels
self.model = joblib.load("../model/scikit_random_model.sav")
def predict(self, feature_values: numpy.ndarray, feature_names: Iterable):
logger.info("Predict called")
prediction = self.model.predict_proba(feature_values)[:,1] # slice and return only second column
logger.info("Returning prediction: %s", prediction)
return numpy.array([prediction]) # must return atleast returns 2D array
requires scikit-learn==0.21.3
I've had to email the zip file to you containing the model file.
This is absolutely fantastic @mwm5945 !! Thank you for such a detailed contribution with benchmark results and model class - this will be really helpful as we continue diving into this optmization and benchmarking evaluations 🚀
Per @cliveseldon's suggestion, testing using tensors: (Default config of Seldon 1.3)
❯ ghz --proto prediction.proto -i . --insecure -c 1 -z 60s -D payload3.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000
Summary:
Count: 10739
Total: 60.00 s
Slowest: 392.76 ms
Fastest: 2.97 ms
Average: 5.53 ms
Requests/sec: 178.98
Response time histogram:
2.972 [1] |
41.951 [10733] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
80.930 [1] |
119.909 [0] |
158.888 [1] |
197.867 [0] |
236.846 [0] |
275.825 [0] |
314.804 [1] |
353.783 [0] |
392.762 [1] |
Latency distribution:
10 % in 4.07 ms
25 % in 4.50 ms
50 % in 5.03 ms
75 % in 5.82 ms
90 % in 7.12 ms
95 % in 8.39 ms
99 % in 12.97 ms
Status code distribution:
[OK] 10738 responses
[Unavailable] 1 responses
Error distribution:
[1] rpc error: code = Unavailable desc = transport is closing
❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload3.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000
Summary:
Count: 8992
Total: 60.00 s
Slowest: 1.05 s
Fastest: 9.71 ms
Average: 33.30 ms
Requests/sec: 149.86
Response time histogram:
9.712 [1] |
113.526 [8969] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
217.341 [10] |
321.155 [0] |
424.969 [5] |
528.783 [0] |
632.597 [0] |
736.411 [0] |
840.225 [0] |
944.039 [0] |
1047.853 [2] |
Latency distribution:
10 % in 22.48 ms
25 % in 26.82 ms
50 % in 31.45 ms
75 % in 37.42 ms
90 % in 44.23 ms
95 % in 50.62 ms
99 % in 66.55 ms
Status code distribution:
[Unavailable] 5 responses
[OK] 8987 responses
Error distribution:
[5] rpc error: code = Unavailable desc = transport is closing
This is likely solved with the multiprocessing/GRPC update
Seldon Version: 1.3.0 Using executor: Yes Protocol: GRPC throughout
We have a graph with three paths that terminate in a combiner. The first two paths are just H2O Mojo models, using the Java wrapper (non-JNI). The third path is a Python input transformer and model, and the combiner (of course python). The H2O models are reliably fast, usually less than 1ms. However, the Python models can spike in latency from time to time. The python nodes are fairly simple in what they do, filter out some features, run an internally developed model (usually runs in ~7ms), and the combiner concatenates the paths and returns that.
Overall, the graphs p75 latency is 10ms, but p99 is ~300ms. After replacing the python nodes with stubs (see below for the class), the model can still take around 13ms to execute, even when it isn't doing anything.
I've captured some DataDog Traces (See #2437 :D), but Github is having some trouble uploading them, so i'll see if i can get those up later.
Stubs:
@axsaucedo @adriangonz for visibility from Slack