[Task] Optimize NVTabular Inference Performance

bschifferer commented 2 years ago

What questions are you trying to answer? Please describe. For the GTC Recommender, we want to deploy NVTabular+Transformer4Rec. Currently, NVTabular is a bottleneck. We experienced the slow NVTabular performance a year ago (see below criteo).

Transformer4Rec: I ran the Transformer4Rec + Triton example https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/end-to-end-session-based/02-End-to-end-session-based-with-Yoochoose-PyT.ipynb and did a basic latency test by sending ~100 requests back-to-back. The first request is slow, but the rest seems to be constant:

Currently:

NVTabular+Transformer4Rec will require >200ms. -- NVTabular will require ~190ms -- Transformer4Rec requires only 45ms, but is still higher than Praveen’s limit.

	Min	25%	50%	75%	Max
NVTabular	186	189	193	194	204
Transformer4Rec	44	46	47	52	90
Dummy Python Model	14	14	14	15	100

Criteo (from 2021: Running the Criteo End-to-End TensorFlow example deploys a ensemble of NVTabular+TensorFlow to Triton. Using perf_analyzer the performance is not as expected. We should investigate the latency of the NVTabular model on Triton.

EvenOldridge commented 1 year ago

@bschifferer what ops are being run here? Aggregation of session wouldn't normally happen at this stage; it should be passed in to the inference server.

Definitely sounds like there are some optimizations we can do, but maybe one of them is to make sure the ops we're running are in line with those being run for GTC.

karlhigley commented 1 year ago

There's a whole category of work here that we haven't really dug into as a team: structuring workflows to make them servable, both at all and then with acceptable latency. DAG sub-graphs seems like they would be relevant here since there are some ops we want to run at serving time (i.e. a serving sub-graph) and others we don't (i.e. a training sub-graph.) And ideally, we'd already have that capability, having identified the need for it over a year ago when I worked on the first end-to-end POC.

Unfortunately, that hasn't happened. This is one of many Merlin infrastructure improvements we saw coming a long way off and haven't been able to work on much for a variety of reasons (continually shifting priorities, perpetual overloading of the team, lack of a clear decision making process for team-wide technical decisions, two of the few developers with relevant context fighting constant CI fires, etc.)

So I don't have any good news here:

This is not likely to be a simple issue to solve in a way that will be easy for our customers to apply
We don't currently have the development bandwidth to work on it, given the other high priority work in flight
It's queued behind several other relatively crucial Merlin-wide infrastructure projects (Merlin dtypes, multi-array columns, cross-framework data transfer, etc.)
I don't have a good workaround to suggest 🙁

NVIDIA-Merlin / NVTabular

[Task] Optimize NVTabular Inference Performance #1251