[FEA] Benchmark Suite for Inference

NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Apache License 2.0

1.05k stars 143 forks source link

[FEA] Benchmark Suite for Inference #817

Open oyilmaz-nvidia opened 3 years ago

oyilmaz-nvidia commented 3 years ago

We need to create a benchmark suite to measure the running time of the inference queries for different input data shapes and models. This benchmark suite will help us to spot where the bottlenecks in the whole inference pipeline. Benchmark suite should at least cover the running time measurements of the points below;

[ ] Each applied transformations in NVTabular and overall inference of NVTabular
[ ] NVTabular + TF ensemble model
[ ] NVTabular + PyTorch ensemble model
[ ] NVTabular + HugeCTR ensemble model
[ ] All communication and data conversion costs in the ensemble models
[ ] Add the measurements to nightly CI

vinhngx commented 3 years ago

+1 I've observed that the NVTab+HugeCTR Triton ensemble is significantly slower than querying the HugeCTR Triton backend directly (using offline NVTab preprocessed data, but doing CSR conversion live) . 700ms vs. 40ms. It would be good to have an idea of how long each phase of the ensemble inference take.

vinhngx commented 3 years ago

@shashank3959

shashank3959 commented 3 years ago

+1 I have the same observation as @vinhngx

benfred commented 3 years ago

@shashank3959 @vinhngx Agreed - the inference latency is unacceptably bad in v0.5. Improving this is a major focus of v0.6.

I've added some benchmarking scripts here using tritons perf_analyzer: https://github.com/NVIDIA/NVTabular/pull/868. There are some graphs of times for the rossmann/movielens datasets there, including some of the speedups we're expecting in 0.6 (like rossmann should go from ~150ms down to around ~2ms per request).

When you're seeing 700ms latency what model is this ? Also is this on the first request ? I was seeing a case with the movielens example where the first request was > 1s but subsequent requests were on the order of 22ms.

vinhngx commented 3 years ago

This is great improvement. The 700ms inference observed is for the movie lens demo. HugeCTR inference time is ~7ms.

We shall verify this at our end with the Movie demo. @shashank3959

vinhngx commented 3 years ago

Verified that with Merlin v0.6 image, inference time of our pipeline is reduced from ~700ms to ~40ms, which is a huge improvement :)

@shashank3959 @benfred FYI

vinhngx commented 3 years ago

As for the model, you can find and e2e notebook here with inference at the end. Basically, it's a full-sized DLRM model on movielens data, with 23 cat and 171 cont. engineered features.

https://gitlab-master.nvidia.com/vinhn/movie-recsys/-/blob/hugectr/RecSys/notebooks/hugectr/hugeCTR-train-and-inference-v0.6.ipynb