NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
938 stars 200 forks source link

[Requirement] Add python DLPack interface for HPS lookup #318

Closed nv-dlasalle closed 1 year ago

nv-dlasalle commented 2 years ago

Is your feature request related to a problem? Please describe. In HPS, for GNNs it would be great to have a python lookup in function which could take in indices on the GPU and return the gathered rows also on the GPU.

Describe the solution you'd like It would be great to have a python function like taking in DLPack objects (https://dmlc.github.io/dlpack/latest/python_spec.html):

def lookup_from_dlpack(indices, out): """ Gather the rows associated with the indices into the tensor out.

Parameters:

indices : DLPack capsule
  The input indices on the GPU to fetch the corresponding rows.
out : DLPack capsule
  The output memory location the lookup--should be of shape [indices.shape[0], embedding_size].

"""

The inputs don't have to be dlpack objects, but at least should be object convertible to dlpack (e.g., PyTorch Tensors, cuPY tensors, etc.).

Describe alternatives you've considered The alternative is to create a second library to wrap the C++ interface of HPS and provide the above function there.

Additional context This would be used both in training and inference of GNNs at large scale.

yingcanw commented 2 years ago

Thanks for your feedback! Support for dlpack will be included in a future release. But the limitations of the python interface need to be clarified.

Due to the hierarchical structure design, HPS is more naturally suitable for custom integration into general-purpose inference platforms, which means that complex deployment scenarios can be customized to improve inference performance. Since we have decoupled hps in 22.05, HPS can be used or encapsulated as an independent library. Therefore, the recommended integration method is to implement customized integration for different inference platforms, such as TensorRT plug-in, tf customized op, etc. Thereby, the inference performance of hps for the target platform can be maximized.

nv-dlasalle commented 2 years ago

@yingcanw Thanks for the feedback. Let me give you some more background on how this would be used for GNNs. We would use this both for training and inference, as the input to a network is a subgraph and embeddings for the nodes and/or edges in that subgraph.

Typically, for a given mini-batch, we fetch the input features associated with the input nodes of our subgraph. The number of input nodes usually ranges from 100 thousand to 1 million, and the input dimension is ranges from 64 to 4096. This means the output of lookup would be in the range 64 MB to 16 GB (a few GB would probably be most common).

We also will have 1 python process per GPU. The GIL isn't usually an issue since most work is performed asynchronously, and python is just used to schedule it. The ability to pass in a stream-id when performing a lookup would be useful in allowing computational work to overlap with the lookup, but at this point is not required.

yingcanw commented 2 years ago

@nv-dlasalle Thanks for your detailed background info. If my understanding is correct, then I think you will basically not use complex HPS deployment scenarios (just 1 python process per GPU due to the large input batch size), so there will be no issue like cuda initialization for multi-process/multi-thread on multi-GPU. Therefore, through HPS supporting dlpack format tensors, it will indeed provide more convenient platform compatibility for offline inference with supporting dlpack capsule input . We will support the DLPack interface for HPS lookup in the next release.

yingcanw commented 1 year ago

The DLpack interface has supported in version 22.07