[RMP] Benchmarking Session-Based Models

bschifferer commented 1 year ago

Problem:

We want to benchmark session-based (transformer-based) architectures in respect of speed-up, costs, inference, latency, etc. to provide guidance to our community.

Goal:

Provide guidance to our community about the performance (computational) and costs oof transformer-based models for training and inference.

Starting Point:

Let's start with inference.

Background [ ] Define experiments: Which dataset, which architecture, which hyperparameters (e.g. sequence length, etc.)

Inference What questions do we want to answer:

What is the throughput of Transformer-Based Model (request/s responded)?
What is the latency (p50, p90, p99)
What are the costs per request with maximal utilization? for following environments:
CPU and GPUs (T4, A10, V100, A100)
OnPrem (without network) and Cloud (including network)
different model architectures (e.g seq len, embedding width, heads, etc.)

Transformer4Rec (PyTorch) [x] Benchmark Inference of Transformer4Rec model without NVTabular (Python Model) like this example Ticket: https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/610 [ ] Benchmark Inference of Transformer4Rec model without NVTabular (TorchScript Model) like this example

Merlin Models (TensorFlow) [ ] Benchmark Inference for REES46 eCoommerce

Training TBD

We should use JMeter for load testing

bschifferer commented 1 year ago

A detailed view is available here: https://docs.google.com/document/d/1g5FUrdhZQzef1OWwiQLfNNGdHr4a71Cr-jqndl-SoQg/edit#

bschifferer commented 1 year ago

Collecting results in a google spreadsheet (details) + some slides as a summary

NVIDIA-Merlin / Merlin