NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
937 stars 200 forks source link

[Question] How to add new models to HPS configuration when using Model Control Mode EXPLICIT? #451

Closed dmac closed 3 months ago

dmac commented 3 months ago

My company runs a Triton fleet that serves multiple models which are continuously retrained and reloaded onto Triton using Model Control Mode EXPLICIT. These are organized into a few different model types, but our convention is that every instance of a model is given a unique name and we don't use the Triton notion of versions.

For example, we have model A training on some cadence and model B training on some cadence, and the latest version of each is loaded on Triton as soon as it's done training. A specific instance of a model is named like "A.1", and when the next iteration of that model is trained, it will be named "A.2", etc. We might also have multiple versions for each type of model loaded onto Triton to support clean transitions from one version of a model to the next. So, at a given moment in time, we might have the following models loaded: A.10, A.11, B.75.

In this example, say we introduce a brand new type of model called C. We also are able to retire model A.10 because it's no longer receiving requests. So at another moment in time we might have these models loaded: A.11, B.75, C.1.

I'm investigating whether it's possible to use the HPS plugin for TensorRT with our existing architecture, without needing to restart Triton whenever our list of active models changes. My understanding is that when we start Triton, we supply a JSON configuration file like:

--backend-config='hps,ps=/path/to/hps.json'

And this file contains the configuration for every model, like:

{
  "supportlonglong": false,
  "models": [{
    "model": "A.10",
    ...
  },{
    "model": "A.11",
    ...
  },{
    "model": "B.75",
    ...
  }]
}

I'm able to explicitly load TRT models that use the HPS plugin, but if I update the JSON file in place and attempt to make inferences on a new model named C.1, Triton reports an error:

2024/06/12 20:21:43 [HCTR][20:21:43.298][ERROR][RK0][tid #139986607271936]: Cannot find the model C.1 in HPS

Is there a way to supply new model configurations once Triton is already running, and to remove configurations that are no longer relevant? I was optimistic after reading this, which mentions "adding the configuration of a new model to the HPS configuration file," but I'm not sure how to do that.

Thanks!

yingcanw commented 3 months ago

@dmac If I understand correctly, you are using the Trition TRT backend to deploy the model with the HPS TRT plugin. If so, the online update feature is only supported in the HPS Triton backend, because the HPS configuration is loaded when the HPS is initialized. For online support of HPS configuration updates, the TRT backend needs to support re-parsing the latest configuration logic like here

dmac commented 3 months ago

If I understand correctly, you are using the Trition TRT backend to deploy the model with the HPS TRT plugin.

Correct.

For online support of HPS configuration updates, the TRT backend needs to support re-parsing the latest configuration logic like here

I see, thanks. Do you know if that feature is planned or has been discussed before? Or is this question better directed at the Triton/TensorRT issues page?

Have you seen anyone using a dynamic, constantly changing HPS configuration like this before, either with the HPS backend, or the TRT backend with the HPS TRT plugin?

dmac commented 3 months ago

One more question: do models served by the Triton HPS backend need to be trained with the HugeCTR framework? Or is there a way to convert an existing ONNX model to a format that is able to be served by the HPS backend? (Currently we train a model with PyTorch, export to ONNX, then convert to TRT.)

yingcanw commented 3 months ago

Have you seen anyone using a dynamic, constantly changing HPS configuration like this before, either with the HPS backend, or the TRT backend with the HPS TRT plugin?

Yes, there are users who have such a requirement, so we support the feature of online updating or unloading models in HPS Triton backend. However, such users mainly use HPS as an independent embedding query service, such as building an HPS+TF/TRT inference pipeline using Triton the ensemble mode, so that they can independently control the online update of the embedding table in HPS backend.

One more question: do models served by the Triton HPS backend need to be trained with the HugeCTR framework? Or is there a way to convert an existing ONNX model to a format that is able to be served by the HPS backend? (Currently we train a model with PyTorch, export to ONNX, then convert to TRT.)

The answer is not necessary, the input format of HPS can be found here here,If you need to convert the torch embedding model to hps format, you can refer to the example.

dmac commented 3 months ago

Ok, it sounds like using an ensemble model with the embeddings looked up from the HPS backend instead of using the HPS TRT plugin is probably what we want. Thanks for the advice.