Occasional Latency Spike in Python Nodes of Inference Graph

mwm5945 commented 3 years ago

Seldon Version: 1.3.0 Using executor: Yes Protocol: GRPC throughout

We have a graph with three paths that terminate in a combiner. The first two paths are just H2O Mojo models, using the Java wrapper (non-JNI). The third path is a Python input transformer and model, and the combiner (of course python). The H2O models are reliably fast, usually less than 1ms. However, the Python models can spike in latency from time to time. The python nodes are fairly simple in what they do, filter out some features, run an internally developed model (usually runs in ~7ms), and the combiner concatenates the paths and returns that.

Overall, the graphs p75 latency is 10ms, but p99 is ~300ms. After replacing the python nodes with stubs (see below for the class), the model can still take around 13ms to execute, even when it isn't doing anything.

I've captured some DataDog Traces (See #2437 :D), but Github is having some trouble uploading them, so i'll see if i can get those up later.

Stubs:

import logging
import sys

logger = logging.getLogger(__name__)
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

class StubMain(object):

    def __init__(self):
        logger.info("Initializing stubs")

    def aggregate(self, X, features_names=None):
            logger.info("Aggregate called")
            return "success"

    def predict(self, X, feature_names, meta=None):
            logger.info("predict called")
            return "prediction"

    def transform_input(self, X, feature_names, meta=None):
            logger.info("transform called")
            return X

@axsaucedo @adriangonz for visibility from Slack

mwm5945 commented 3 years ago

K8s Version: 1.17.8 Istio Version: 1.2.5 (old i know :D )

ukclivecox commented 3 years ago

A few questions to try to narrow things down:

Does it happen without the istio sidecar?
Does it happen also with REST rather than gRPC?
Is this under constant high load or does it happen under low load?
Does it happen with Flask (debug mode) as opposed to Gunicorn?

ukclivecox commented 3 years ago

Could this be something to investigate @mwm5945 https://github.com/SeldonIO/seldon-core/issues/931

mwm5945 commented 3 years ago

I'll take a look at these... though im on vacation next week so it may be a little while :)

mwm5945 commented 3 years ago

Does it happen without the istio sidecar?

yes.

Does it happen also with REST rather than gRPC?

We're GRPC throughout--testing with REST would be a bit difficult to be honest since i can't mix protocols like i could with the old engine

Is this under constant high load or does it happen under low load?

high Load

Does it happen with Flask (debug mode) as opposed to Gunicorn?

Since we're using GRPC, this doesn't actually affect anything (See https://github.com/SeldonIO/seldon-core/issues/3238)

Could this be something to investigate @mwm5945 #931

I doubt it, since the only thing that my "empty" graph is importing is opentracing (no other dependencies)

( @axsaucedo from our slack convo)

mwm5945 commented 3 years ago

unsurprisingly, it looks like performance "normalizes" when the number of replicas increases from 2 to 5 (with about ~250 TPS), which leads me to believe there's something going on with python behind the scenes (maybe the GIL?)

mwm5945 commented 3 years ago

Does it happen also with REST rather than gRPC?

I've been able to test locally with a real SKLearn model (not just empty), and REST does perform significantly better than GRPC:		REST (default)	REST (10workers, 1 Thread)	GRPC (1 Process, 10 threads)	GRPC (10 Processess, 1 Thread)
1 Client Thread	95ms	98ms	93ms	93ms
5 Client Threads	836ms	140ms	863.83ms	147.5ms

I've created my own version of the python microservice, and implemented grpc mutliprocessing as done here: https://github.com/grpc/grpc/tree/master/examples/python/multiprocessing

mwm5945 commented 3 years ago

As per our discussion in Zoom, i'll re-create the above test using a mock model for replication on the Seldon side, and i'll include p* stats as well. (Cc: @axsaucedo @RafalSkolasinski )

axsaucedo commented 3 years ago

Thank you @mwm5945 - as discussed, here's the conclusions from the discussion earlier today:

They behaviour they are seeing seems to be most likely caused by Python GIL
From the tests he carried out, it does seem like the issue is that the GRPC worker is started with 10 threads
This causes for 1 process starting, 9 requests being received but sitting idle, and then resulting in the latency of requests being significantly longer
It was also possible to validate that if it was possible to add 10 multiprocess workers to GRPC the performance on small data would be equal to REST, but unfortunately atm there's no GUNICORN equivallent for GRPC

There are similar points reported on REST:

An individual seeing spikes in p99 latency in REST
For that person most likely it's also caused due to the default GUNICORN_THREADS being 10

Based on these the action points are:

Immedate term to add GRPC_THREADS param to make that GRPC param configurable
Immediate term to update GUNICORN_THREADS to be 1 by default
Medium term to find whether there's a GUNICORN equivallent for GRPC to enable for multiprocessing for GRPC

mwm5945 commented 3 years ago

Testing on local machinen via a Python 3.7 Container, and a basic SKLearn Model

Seldon 1.2 GRPC (Non-configurable concurrency)

Single Threaded Client

❯ ghz --proto prediction.proto -i . --insecure -c 1 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000

Summary:
  Count:    15156
  Total:    60.00 s
  Slowest:  108.71 ms
  Fastest:  2.57 ms
  Average:  3.91 ms
  Requests/sec: 252.60

Response time histogram:
  2.569 [1] |
  13.184 [15126]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  23.798 [20]   |
  34.413 [2]    |
  45.027 [2]    |
  55.641 [2]    |
  66.256 [0]    |
  76.870 [0]    |
  87.484 [1]    |
  98.099 [0]    |
  108.713 [1]   |

Latency distribution:
  10 % in 3.12 ms
  25 % in 3.36 ms
  50 % in 3.71 ms
  75 % in 4.16 ms
  90 % in 4.74 ms
  95 % in 5.28 ms
  99 % in 7.22 ms

Status code distribution:
  [OK]            15155 responses
  [Unavailable]   1 responses

Error distribution:
  [1]   rpc error: code = Unavailable desc = transport is closing

5 Client Threads

❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000

Summary:
  Count:    13117
  Total:    60.00 s
  Slowest:  262.70 ms
  Fastest:  4.72 ms
  Average:  22.82 ms
  Requests/sec: 218.62

Response time histogram:
  4.718 [1] |
  30.515 [11540]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  56.313 [1483] |∎∎∎∎∎
  82.111 [58]   |
  107.909 [5]   |
  133.707 [7]   |
  159.505 [3]   |
  185.303 [4]   |
  211.101 [1]   |
  236.899 [5]   |
  262.696 [5]   |

Latency distribution:
  10 % in 13.57 ms
  25 % in 17.72 ms
  50 % in 21.95 ms
  75 % in 26.56 ms
  90 % in 31.36 ms
  95 % in 34.80 ms
  99 % in 47.87 ms

Status code distribution:
  [Unavailable]   5 responses
  [OK]            13112 responses

Error distribution:
  [5]   rpc error: code = Unavailable desc = transport is closing

Seldon 1.2 REST (Default - 1 Worker, 10 Threads)

1 Client Thread

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=1 | vegeta report -type=text
Requests      [total, rate, throughput]         7646, 127.43, 127.42
Duration      [total, attack, wait]             1m0s, 1m0s, 7.531ms
Latencies     [min, mean, 50, 90, 95, 99, max]  5.14ms, 7.845ms, 7.416ms, 8.665ms, 9.528ms, 13.803ms, 919.981ms
Bytes In      [total, mean]                     443468, 58.00
Bytes Out     [total, mean]                     237026, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:7646
Error Set:

5 Client Threads

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=10 | vegeta report -type=text
Requests      [total, rate, throughput]         7515, 125.23, 125.16
Duration      [total, attack, wait]             1m0s, 1m0s, 35.299ms
Latencies     [min, mean, 50, 90, 95, 99, max]  12.396ms, 79.879ms, 76.715ms, 103.414ms, 115.258ms, 152.349ms, 246.422ms
Bytes In      [total, mean]                     435870, 58.00
Bytes Out     [total, mean]                     232965, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:7515
Error Set:

Seldon 1.2 REST (10 Workers, 1 Thread each)

1 Client Thread

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=1 | vegeta report -type=text
Requests      [total, rate, throughput]         8349, 139.13, 139.12
Duration      [total, attack, wait]             1m0s, 1m0s, 7.65ms
Latencies     [min, mean, 50, 90, 95, 99, max]  4.964ms, 7.185ms, 6.816ms, 8.378ms, 9.212ms, 12.838ms, 54.27ms
Bytes In      [total, mean]                     484242, 58.00
Bytes Out     [total, mean]                     258819, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:8349
Error Set:

5 Client Threads

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=5 | vegeta report -type=text
Requests      [total, rate, throughput]         14290, 238.15, 238.09
Duration      [total, attack, wait]             1m0s, 1m0s, 15.183ms
Latencies     [min, mean, 50, 90, 95, 99, max]  6.602ms, 20.99ms, 17.953ms, 28.09ms, 32.741ms, 60.559ms, 997.28ms
Bytes In      [total, mean]                     828820, 58.00
Bytes Out     [total, mean]                     442990, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:14290
Error Set:

Seldon 1.3 GRPC (Default - 1 Process, 10 Threads)

1 Client Thread

❯ ghz --proto prediction.proto -i . --insecure -c 1 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000

Summary:
  Count:    12405
  Total:    60.00 s
  Slowest:  63.22 ms
  Fastest:  3.09 ms
  Average:  4.79 ms
  Requests/sec: 206.75

Response time histogram:
  3.090 [1] |
  9.102 [12259] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  15.115 [79]   |
  21.128 [51]   |
  27.140 [9]    |
  33.153 [3]    |
  39.166 [0]    |
  45.178 [1]    |
  51.191 [0]    |
  57.204 [0]    |
  63.217 [1]    |

Latency distribution:
  10 % in 3.76 ms
  25 % in 4.06 ms
  50 % in 4.49 ms
  75 % in 5.09 ms
  90 % in 5.88 ms
  95 % in 6.57 ms
  99 % in 9.75 ms

Status code distribution:
  [Unavailable]   1 responses
  [OK]            12404 responses

Error distribution:
  [1]   rpc error: code = Unavailable desc = transport is closing

5 Client Threads

❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000

Summary:
  Count:    10313
  Total:    60.00 s
  Slowest:  182.28 ms
  Fastest:  6.22 ms
  Average:  29.04 ms
  Requests/sec: 171.88

Response time histogram:
  6.223 [1] |
  23.829 [2607] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  41.434 [6999] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  59.040 [634]  |∎∎∎∎
  76.645 [49]   |
  94.251 [7]    |
  111.856 [4]   |
  129.462 [1]   |
  147.067 [4]   |
  164.673 [1]   |
  182.278 [1]   |

Latency distribution:
  10 % in 19.52 ms
  25 % in 23.76 ms
  50 % in 28.26 ms
  75 % in 33.19 ms
  90 % in 38.85 ms
  95 % in 43.57 ms
  99 % in 56.12 ms

Status code distribution:
  [OK]            10308 responses
  [Unavailable]   5 responses

Error distribution:
  [5]   rpc error: code = Unavailable desc = transport is closing

Seldon 1.3 (fork) GRPC (Default - 10 Processes, 1 Thread each)

1 Client Thread

❯ ghz --proto prediction.proto -i . --insecure -c 1 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000

Summary:
  Count:    13514
  Total:    60.00 s
  Slowest:  125.90 ms
  Fastest:  2.91 ms
  Average:  4.39 ms
  Requests/sec: 225.23

Response time histogram:
  2.911 [1] |
  15.211 [13491]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  27.510 [13]   |
  39.809 [3]    |
  52.108 [0]    |
  64.408 [1]    |
  76.707 [1]    |
  89.006 [2]    |
  101.306 [0]   |
  113.605 [0]   |
  125.904 [1]   |

Latency distribution:
  10 % in 3.57 ms
  25 % in 3.92 ms
  50 % in 4.26 ms
  75 % in 4.65 ms
  90 % in 5.11 ms
  95 % in 5.47 ms
  99 % in 7.00 ms

Status code distribution:
  [OK]            13513 responses
  [Unavailable]   1 responses

Error distribution:
  [1]   rpc error: code = Unavailable desc = transport is closing

5 client Threads

❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000

Summary:
  Count:    13735
  Total:    60.00 s
  Slowest:  134.07 ms
  Fastest:  4.68 ms
  Average:  21.79 ms
  Requests/sec: 228.92

Response time histogram:
  4.678 [1] |
  17.617 [663]  |∎∎
  30.557 [12741]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  43.497 [242]  |∎
  56.436 [33]   |
  69.376 [24]   |
  82.316 [13]   |
  95.255 [6]    |
  108.195 [4]   |
  121.135 [2]   |
  134.074 [1]   |

Latency distribution:
  10 % in 18.24 ms
  25 % in 19.69 ms
  50 % in 21.08 ms
  75 % in 22.81 ms
  90 % in 25.15 ms
  95 % in 27.45 ms
  99 % in 36.96 ms

Status code distribution:
  [OK]            13730 responses
  [Unavailable]   5 responses

Error distribution:
  [5]   rpc error: code = Unavailable desc = transport is closing

5 Client Threads over 5 Separate GRPC Connections

❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=5 localhost:5000

Summary:
  Count:    22896
  Total:    60.00 s
  Slowest:  55.57 ms
  Fastest:  5.36 ms
  Average:  13.03 ms
  Requests/sec: 381.60

Response time histogram:
  5.362 [1] |
  10.383 [4107] |∎∎∎∎∎∎∎∎∎∎∎
  15.404 [14815]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  20.425 [3291] |∎∎∎∎∎∎∎∎∎
  25.447 [486]  |∎
  30.468 [112]  |
  35.489 [38]   |
  40.510 [22]   |
  45.531 [11]   |
  50.552 [4]    |
  55.574 [4]    |

Latency distribution:
  10 % in 9.61 ms
  25 % in 10.92 ms
  50 % in 12.49 ms
  75 % in 14.41 ms
  90 % in 16.91 ms
  95 % in 18.86 ms
  99 % in 24.67 ms

Status code distribution:
  [OK]            22891 responses
  [Unavailable]   5 responses

Error distribution:
  [5]   rpc error: code = Unavailable desc = transport is closing

10 Client Threads over 10 Separate GRPC Connections

❯ ghz --proto prediction.proto -i . --insecure -c 10 -z 60s -D payload2.json --call seldon.protos.Seldon.Predict --connections=10 localhost:5000

Summary:
  Count:    22931
  Total:    60.00 s
  Slowest:  394.52 ms
  Fastest:  6.53 ms
  Average:  26.09 ms
  Requests/sec: 382.18

Response time histogram:
  6.525 [1] |
  45.324 [20152]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  84.124 [2645] |∎∎∎∎∎
  122.923 [111] |
  161.722 [3]   |
  200.521 [1]   |
  239.321 [0]   |
  278.120 [3]   |
  316.919 [2]   |
  355.719 [1]   |
  394.518 [2]   |

Latency distribution:
  10 % in 15.05 ms
  25 % in 17.61 ms
  50 % in 20.93 ms
  75 % in 26.42 ms
  90 % in 51.65 ms
  95 % in 62.43 ms
  99 % in 78.89 ms

Status code distribution:
  [OK]            22921 responses
  [Unavailable]   10 responses

Error distribution:
  [10]   rpc error: code = Unavailable desc = transport is closing

Seldon 1.3 REST (Default)

1 Client Thread

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=1 | vegeta report -type=text
Requests      [total, rate, throughput]         11183, 186.38, 186.36
Duration      [total, attack, wait]             1m0s, 1m0s, 5.486ms
Latencies     [min, mean, 50, 90, 95, 99, max]  3.319ms, 5.364ms, 5.09ms, 6.513ms, 7.297ms, 10.422ms, 253.062ms
Bytes In      [total, mean]                     648614, 58.00
Bytes Out     [total, mean]                     346673, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:11183
Error Set:

5 Client Threads

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=5 | vegeta report -type=text
Requests      [total, rate, throughput]         9026, 150.42, 150.38
Duration      [total, attack, wait]             1m0s, 1m0s, 16.055ms
Latencies     [min, mean, 50, 90, 95, 99, max]  6.069ms, 33.245ms, 32.001ms, 42.656ms, 46.568ms, 55.496ms, 402.204ms
Bytes In      [total, mean]                     523508, 58.00
Bytes Out     [total, mean]                     279806, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:9026
Error Set:

Seldon 1.3 REST (10 Workers, 1 Thread each)

1 Client Thread

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=1 | vegeta report -type=text
Requests      [total, rate, throughput]         6973, 116.20, 116.19
Duration      [total, attack, wait]             1m0s, 1m0s, 8.144ms
Latencies     [min, mean, 50, 90, 95, 99, max]  5.607ms, 8.603ms, 8.231ms, 9.7ms, 10.404ms, 12.848ms, 385.654ms
Bytes In      [total, mean]                     404434, 58.00
Bytes Out     [total, mean]                     216163, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:6973
Error Set:

5 Client Threads

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=5 | vegeta report -type=text
Requests      [total, rate, throughput]         12283, 204.43, 204.40
Duration      [total, attack, wait]             1m0s, 1m0s, 10.642ms
Latencies     [min, mean, 50, 90, 95, 99, max]  7.442ms, 24.451ms, 21.396ms, 31.013ms, 35.486ms, 69.883ms, 1.024s
Bytes In      [total, mean]                     712414, 58.00
Bytes Out     [total, mean]                     380773, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:12283
Error Set:

10 Client Threads

❯ jq -ncM '{method: "POST", url: "http://localhost:5000/api/v1.0/predictions", body: "{\"data\":{\"ndarray\":[[10,100]]}}" | @base64, header: {"Content-Type": ["application/json"]}}' |
  vegeta attack -format=json -duration=60s -rate=0 -max-workers=10 | vegeta report -type=text
Requests      [total, rate, throughput]         11764, 195.93, 195.89
Duration      [total, attack, wait]             1m0s, 1m0s, 10.477ms
Latencies     [min, mean, 50, 90, 95, 99, max]  10.477ms, 51.03ms, 45.36ms, 64.456ms, 76.103ms, 173.435ms, 1.055s
Bytes In      [total, mean]                     682312, 58.00
Bytes Out     [total, mean]                     364684, 31.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:11764
Error Set:

mwm5945 commented 3 years ago

ModelMain.py:

from collections.abc import Iterable
import sys
import logging
import joblib
import numpy

logger = logging.getLogger(__name__)
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('[PID %(process)d - TID %(thread)d] %(lineno)d %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

class SKR(object):
    def __init__(self):
        logger.info("Initializing model")
        self.class_names = ["random"]  # model output labels
        self.model = joblib.load("../model/scikit_random_model.sav")

    def predict(self, feature_values: numpy.ndarray, feature_names: Iterable):
        logger.info("Predict called")
        prediction = self.model.predict_proba(feature_values)[:,1] # slice and return only second column
        logger.info("Returning prediction: %s", prediction)
        return numpy.array([prediction]) # must return atleast returns 2D array

requires scikit-learn==0.21.3

I've had to email the zip file to you containing the model file.

axsaucedo commented 3 years ago

This is absolutely fantastic @mwm5945 !! Thank you for such a detailed contribution with benchmark results and model class - this will be really helpful as we continue diving into this optmization and benchmarking evaluations 🚀

mwm5945 commented 3 years ago

Per @cliveseldon's suggestion, testing using tensors: (Default config of Seldon 1.3)

Single threaded client

❯ ghz --proto prediction.proto -i . --insecure -c 1 -z 60s -D payload3.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000

Summary:
  Count:    10739
  Total:    60.00 s
  Slowest:  392.76 ms
  Fastest:  2.97 ms
  Average:  5.53 ms
  Requests/sec: 178.98

Response time histogram:
  2.972 [1] |
  41.951 [10733]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  80.930 [1]    |
  119.909 [0]   |
  158.888 [1]   |
  197.867 [0]   |
  236.846 [0]   |
  275.825 [0]   |
  314.804 [1]   |
  353.783 [0]   |
  392.762 [1]   |

Latency distribution:
  10 % in 4.07 ms
  25 % in 4.50 ms
  50 % in 5.03 ms
  75 % in 5.82 ms
  90 % in 7.12 ms
  95 % in 8.39 ms
  99 % in 12.97 ms

Status code distribution:
  [OK]            10738 responses
  [Unavailable]   1 responses

Error distribution:
  [1]   rpc error: code = Unavailable desc = transport is closing

Multithreaded client

❯ ghz --proto prediction.proto -i . --insecure -c 5 -z 60s -D payload3.json --call seldon.protos.Seldon.Predict --connections=1 localhost:5000

Summary:
  Count:    8992
  Total:    60.00 s
  Slowest:  1.05 s
  Fastest:  9.71 ms
  Average:  33.30 ms
  Requests/sec: 149.86

Response time histogram:
  9.712 [1] |
  113.526 [8969]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  217.341 [10]  |
  321.155 [0]   |
  424.969 [5]   |
  528.783 [0]   |
  632.597 [0]   |
  736.411 [0]   |
  840.225 [0]   |
  944.039 [0]   |
  1047.853 [2]  |

Latency distribution:
  10 % in 22.48 ms
  25 % in 26.82 ms
  50 % in 31.45 ms
  75 % in 37.42 ms
  90 % in 44.23 ms
  95 % in 50.62 ms
  99 % in 66.55 ms

Status code distribution:
  [Unavailable]   5 responses
  [OK]            8987 responses

Error distribution:
  [5]   rpc error: code = Unavailable desc = transport is closing

mwm5945 commented 3 years ago

This is likely solved with the multiprocessing/GRPC update

SeldonIO / seldon-core