The Triton server(s) could be organized in several different ways for a realistic production deployment.
A. One server per model
Requires some central map of IP:model name
Does this imply one model per GPU?
B. Single server for all models (and all GPUs)
Load-balancing already works well
Need to ensure serving multiple models can be done efficiently
C. Some hybrid of A and B
D. Other?
In addition, it's likely that at least each Tier1/Tier2 would eventually have their own GPU servers (to reduce latency). The IP addresses of each site's server(s) could be tracked in e.g. site-local-config.xml or another appropriate part of the production infrastructure.
Triton 2.X supports https/ssl, which could potentially be used for client-server authentication in production to maintain security.
The Triton server(s) could be organized in several different ways for a realistic production deployment.
A. One server per model
B. Single server for all models (and all GPUs)
C. Some hybrid of A and B
D. Other?
In addition, it's likely that at least each Tier1/Tier2 would eventually have their own GPU servers (to reduce latency). The IP addresses of each site's server(s) could be tracked in e.g.
site-local-config.xml
or another appropriate part of the production infrastructure.Triton 2.X supports https/ssl, which could potentially be used for client-server authentication in production to maintain security.
attn: @violatingcp @holzman @mapsacosta