Open bschifferer opened 2 years ago
@bschifferer what ops are being run here? Aggregation of session wouldn't normally happen at this stage; it should be passed in to the inference server.
Definitely sounds like there are some optimizations we can do, but maybe one of them is to make sure the ops we're running are in line with those being run for GTC.
There's a whole category of work here that we haven't really dug into as a team: structuring workflows to make them servable, both at all and then with acceptable latency. DAG sub-graphs seems like they would be relevant here since there are some ops we want to run at serving time (i.e. a serving sub-graph) and others we don't (i.e. a training sub-graph.) And ideally, we'd already have that capability, having identified the need for it over a year ago when I worked on the first end-to-end POC.
Unfortunately, that hasn't happened. This is one of many Merlin infrastructure improvements we saw coming a long way off and haven't been able to work on much for a variety of reasons (continually shifting priorities, perpetual overloading of the team, lack of a clear decision making process for team-wide technical decisions, two of the few developers with relevant context fighting constant CI fires, etc.)
So I don't have any good news here:
What questions are you trying to answer? Please describe. For the GTC Recommender, we want to deploy NVTabular+Transformer4Rec. Currently, NVTabular is a bottleneck. We experienced the slow NVTabular performance a year ago (see below criteo).
Transformer4Rec: I ran the Transformer4Rec + Triton example https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/end-to-end-session-based/02-End-to-end-session-based-with-Yoochoose-PyT.ipynb and did a basic latency test by sending ~100 requests back-to-back. The first request is slow, but the rest seems to be constant:
Currently:
Criteo (from 2021: Running the Criteo End-to-End TensorFlow example deploys a ensemble of NVTabular+TensorFlow to Triton. Using perf_analyzer the performance is not as expected. We should investigate the latency of the NVTabular model on Triton.