NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
739 stars 113 forks source link

[RMP] Improve Systems observability w/ Open Telemetry integration (logging, metrics?, traces?) #642

Open EvenOldridge opened 1 year ago

EvenOldridge commented 1 year ago

Problem:

Production recommender systems require logging. Without logging metrics, etc it's hard to know what your recommender is doing and troubleshoot.

Goal:

Answer questions like:

New Functionality

Constraints:

NV standard is to use OpenTelemetry which provides most of the infrastructure.

Ability to record:

We should limit our work to exposing information from inside of the Merlin Systems DAG. For example, It's currently possible for a user to measure the latency of requests to Triton, but the ensemble is a "black box" and we only know how long it takes to execute the entire thing. With this work, we should expose how long each component of the DAG takes, so that someone can know, for example, how long the TransformWorkflow, PredictPytorch, QueryFeast, etc. take within a single request.

Starting Point:

Provide examples that demonstrate how to use opentelemetry to handle logs, metrics and traces:

Integration with Triton Tracing

Triton has support for opentelemetry tracing starting in 23.04, and are adding the ability to trace BLS models. Coordinate with them to ensure that the functionality we are building is supported.

Image

viswa-nvidia commented 1 year ago

Some development effort was done in hack week.

viswa-nvidia commented 1 year ago

@nv-alaiacano please define this ticket for 22.12

karlhigley commented 1 year ago

Still good to define it a bit, but I think this issue is likely to get pushed to 23.01 or later

viswa-nvidia commented 1 year ago

@nv-alaiacano - 3 big development efforts need to be broken to RMP tickets

viswa-nvidia commented 1 year ago

@nv-alaiacano and @karlhigley to come up with the minimum set of features for 23.04 completion