Open avianion opened 2 months ago
@samurdhikaru could you help advise on the plugin question?
@avianion Your plugin implementation is strange:
In UpdateInferenceInputsPlugin::supportsFormatCombination
, inputs[1]
seems to be interpreted as INT32
UpdateInferenceInputsPlugin::supportsFormatCombination(
int pos, nvinfer1::PluginTensorDesc const *inOut, int nbInputs, int nbOutputs) noexcept
{
...
// Check input tensors
if (pos < nbInputs)
{
if (pos == 0 || pos == 1 || pos == 4 || pos == 5)
{
return inOut[pos].type == DataType::kINT32 && inOut[pos].format == TensorFormat::kLINEAR;
}
UpdateInferenceInputsPlugin::getOutputDataType
indicates that outputs[1]
is to be HALF
nvinfer1::DataType UpdateInferenceInputsPlugin::getOutputDataType(
...
else
{
// For output_hidden_states, return HALF
return DataType::kHALF;
}
}
Then you're copying from inputs[1]
to outputs[1]
cudaMemcpyAsync(outputs[1], inputs[1], hidden_states_size_original, cudaMemcpyDeviceToDevice, stream);
What is the intended effect here?
Description
I am unable to do an identity operation which involves copying out a tensor 1:1 that is float16. It works fine with an int32 and int64 tensor, but this is a basis of a plugin I am trying to create that manipulates float16. However, if the identity (copying out as output) operation doesn't work, I'm not sure what to do.
self.debug_buffer["full_hidden_states"] tensor([[ 4.1797, -0.2051, -1.8369, ..., -2.8945, 1.3564, 0.3196], [ 2.1680, -2.5840, -2.0742, ..., -2.5312, -0.4636, -2.8867], [ 2.1934, -3.1797, 1.9854, ..., 0.7856, -0.0352, -1.5967], ..., [-0.6577, 2.3828, 7.0742, ..., 2.1074, -1.9043, -0.1153], [ 0.9321, -3.4199, 0.9727, ..., 0.4680, -3.3691, -1.2725], [-0.6484, 0.9282, 0.6196, ..., 5.9570, -4.6875, -0.6816]], device='cuda:0', dtype=torch.float16)
This is the tensor I am tryng to copy out in, and this is the result I'm getting with the below tensorrt plugin I have created
self.debug_buffer['output_hidden_states'] tensor([[[ 4.1797, -0.2051, -1.8369, ..., -2.8945, 1.3564, 0.3196], [ 2.1680, -2.5840, -2.0742, ..., -2.5312, -0.4636, -2.8867], [ 2.1934, -3.1797, 1.9854, ..., 0.7856, -0.0352, -1.5967], ..., [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]], device='cuda:0', dtype=torch.float16)
This is bizzare, I'm hoping I have misconfigured some settings or something, because evidently this is rather strange. In the below code I am smply doing a straight up copy out operation. The model I'm running is Meta Llama 3 Instruct 8B, but I feel like this issue is independent of model.
I am building the network with TensorRT-LLM.
Environment
TensorRT Version: 10.3
NVIDIA GPU: 2x H100 NVL
NVIDIA Driver Version: 5 555.42.06
CUDA Version: 12.5
CUDNN Version:
Operating System:
Python Version (if applicable): 3.10
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
Steps To Reproduce
1) Build Llama 3 8B with TensorRT-LLM using official instructions. 2) Create TensorRT-LLM plugin with the code I have above 3) Copy out any float16 tensor 4) Use self.debug_buffer in the Python runtime to log the float16 tensor and observe it will be different.
Commands or scripts:
Have you tried the latest release?:
Yes
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
):Yes.