ganler / ResearchReading

General system research material (not limited to paper) reading notes.
GNU General Public License v3.0
20 stars 1 forks source link

OSDI'20 | Serving DNNs like Clockwork: Performance Predictability from the Bottom Up #34

Closed ganler closed 3 years ago

ganler commented 3 years ago

Paper: https://www.usenix.org/conference/osdi20/presentation/gujarati This will be presented in OSDI today.

ganler commented 3 years ago

Motivation: Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times.

Target: Make requests highly predictable. (i.e., they should meet the service-level objectives in a very strict manner)

ganler commented 3 years ago

What's different: Current systems did not take sharing into account(i.e., one model takes the whole GPU). When we have multiple models and the request rate is very low, we cannot just let those models occupy the GPU memory and do nothing.

Efficiently serving models with low request rates requires a large number of models to share accelerators; no existing model serving system supports this.

But wait, you can have concurrent instances using Trition... https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

image

ganler commented 3 years ago

How does Clockwork solve this problem?

Big Idea: If we estimate that some workers cannot meet the given SLO (according to current conditions), then the worker will not be allowed be execute the request.

Operation Space:

Summary

Clockwork focuses on enabling reliable SLO. However, this may hurt performance, as many optimization techniques will break the predictability...

Anti-Perf

As is said in ClockWork.

To improve predictability, Clockwork disables JIT compilation and the caching of CUDA kernels.

And other things: Batching, XLA, etc.

And there are other things unpredictable: Say the input images are of variable sizes...

Time of inference may not be deterministic

And I think there're some claims that I don't agree with:

Conceptually, a DNN inference is a fully deterministic execution. Each DNN inference request carries a fixed-size input tensor argument;

No, lots of models can carry tensors of variable shapes. For example, ResNets leverage GAP layer which is independent of the shape of feature map sizes.