Open EvenOldridge opened 2 years ago
Some development effort was done in hack week.
@nv-alaiacano please define this ticket for 22.12
Still good to define it a bit, but I think this issue is likely to get pushed to 23.01 or later
@nv-alaiacano - 3 big development efforts need to be broken to RMP tickets
@nv-alaiacano and @karlhigley to come up with the minimum set of features for 23.04 completion
Problem:
Production recommender systems require logging. Without logging metrics, etc it's hard to know what your recommender is doing and troubleshoot.
Goal:
Answer questions like:
New Functionality
Constraints:
NV standard is to use OpenTelemetry which provides most of the infrastructure.
Ability to record:
We should limit our work to exposing information from inside of the Merlin Systems DAG. For example, It's currently possible for a user to measure the latency of requests to Triton, but the ensemble is a "black box" and we only know how long it takes to execute the entire thing. With this work, we should expose how long each component of the DAG takes, so that someone can know, for example, how long the TransformWorkflow, PredictPytorch, QueryFeast, etc. take within a single request.
Starting Point:
Provide examples that demonstrate how to use opentelemetry to handle logs, metrics and traces:
Integration with Triton Tracing
Triton has support for opentelemetry tracing starting in 23.04, and are adding the ability to trace BLS models. Coordinate with them to ensure that the functionality we are building is supported.