NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
We want to benchmark session-based (transformer-based) architectures in respect of speed-up, costs, inference, latency, etc. to provide guidance to our community.
Goal:
Provide guidance to our community about the performance (computational) and costs oof transformer-based models for training and inference.
Starting Point:
Let's start with inference.
Background
[ ] Define experiments: Which dataset, which architecture, which hyperparameters (e.g. sequence length, etc.)
Inference
What questions do we want to answer:
What is the throughput of Transformer-Based Model (request/s responded)?
What is the latency (p50, p90, p99)
What are the costs per request with maximal utilization?
for following environments:
CPU and GPUs (T4, A10, V100, A100)
OnPrem (without network) and Cloud (including network)
different model architectures (e.g seq len, embedding width, heads, etc.)
Problem:
We want to benchmark session-based (transformer-based) architectures in respect of speed-up, costs, inference, latency, etc. to provide guidance to our community.
Goal:
Provide guidance to our community about the performance (computational) and costs oof transformer-based models for training and inference.
Starting Point:
Let's start with inference.
Background [ ] Define experiments: Which dataset, which architecture, which hyperparameters (e.g. sequence length, etc.)
Inference What questions do we want to answer:
Transformer4Rec (PyTorch) [x] Benchmark Inference of Transformer4Rec model without NVTabular (Python Model) like this example Ticket: https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/610 [ ] Benchmark Inference of Transformer4Rec model without NVTabular (TorchScript Model) like this example
Merlin Models (TensorFlow) [ ] Benchmark Inference for REES46 eCoommerce
Training TBD
We should use JMeter for load testing