how can I get pytorch tensor from GPU memory without copying？

NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

https://developer.nvidia.com/tensorrt

Apache License 2.0

10.55k stars 2.1k forks source link

how can I get pytorch tensor from GPU memory without copying？ #62

Closed linssswww closed 5 years ago

linssswww commented 5 years ago

I want to speed up the part of faster-rcnn-fpn, which is extractor of feature map. the feature map size is large. and I get the output of tensorrt which is mem_alloc object, but I need pytorch tensor object. I try to convert mem_alloc object to pytorch tensor, but it spend too much time in memcpy from gpu to cpu. how to convert data type from cuda.mem_alloc object to pytorch tensor object without copying?

my code:

binding = [int(d_input), int(d_output[0]), int(d_output[1]), int(d_output[2]), int(d_output[3])]
cuda.memcpy_htod_async(d_input, input_data_tensor.data.cpu().numpy().astype(NPDTYPE), stream)
context.execute(1, binding)
cuda.memcpy_dtoh_async(output1, d_output[0], stream)
cuda.memcpy_dtoh_async(output2, d_output[1], stream)
cuda.memcpy_dtoh_async(output3, d_output[2], stream)
cuda.memcpy_dtoh_async(output4, d_output[3], stream)
stream.synchronize()

ou1 = torch.tensor(output1, device="cuda")
ou2 = torch.tensor(output2, device="cuda")
ou3 = torch.tensor(output3, device="cuda")
ou4 = torch.tensor(output4, device="cuda")

narendasan commented 5 years ago

You can potentially use PyTorch for your data management entirely. This is an example of using PyTorch Tensors for both the input and output buffers of the engine (as opposed to pycuda). https://github.com/NVIDIA-AI-IOT/torch2trt/blob/master/torch2trt/torch2trt.py#L206

linssswww commented 5 years ago

@narendasan thanks for your help， my problem has been solved, but there is a new problem which I speed up the feature map, it just speed up 10%. it is normal ?

narendasan commented 5 years ago

You might want to try using a reduced operating precision (FP16 or INT8) to further improve performance

zimenglan-sysu-512 commented 5 years ago

hi @narendasan and @linssswww u can see this jetbot/tensorrt_model.py. it is more simple to do it. hope it can help u.

pageedward commented 4 years ago

hi @linssswww why do i convert fpn in pytorch to tensorrt's , but tensorrt one is slower than pytorch ? the issue in link : https://github.com/NVIDIA/TensorRT/issues/458

prathik-naidu commented 4 years ago

@zimenglan-sysu-512 @linssswww i tried implementing inference with pytorch tensors as bindings however i’m running into an issue: https://github.com/NVIDIA/TensorRT/issues/303#issuecomment-652187126

Basically getting ‘../rtSafe/safeContext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)’ on a simple test case

Any ideas why this might be happening? The error only happens when I use pytorch tensor bindings (if i cuda.mem_alloc then this issue doesn’t happen). I’m trying to get gpu torch tensors as output from my tensorRT engine.

Thanks!

rmccorm4 commented 4 years ago

Hi @prathik-naidu ,

i tried implementing inference with pytorch tensors as bindings however i’m running into an issue: #303 (comment)

Basically getting ‘../rtSafe/safeContext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)’ on a simple test case

I believe this is a PyTorch issue: https://github.com/pytorch/pytorch/issues/32983

prathik-naidu commented 4 years ago

Thanks @rmccorm4, didn't see this issue before. Was there any solution to this? Not sure if differences in cuDNN version is an explanation (although for me this doesn't seem to be an issue since I'm on 7.6 looking at my /usr/include/cudnn.h and pytorch + tensorRT both match that).

prathik-naidu commented 4 years ago

@rmccorm4 I was continuing to investigate this issue and figured out how to get this working (the CUDDNN_STATUS_MAPPING_ERROR is resolved):

       self.engine = self._load_engine()
       self.context = self.engine.create_execution_context()
       inputs = [torch.ones((1, 3, 256, 416), device="cuda:0")] # move this line BEFORE pycuda.autoinit

       import pycuda.autoinit
        outputs = [torch.zeros((1, 3, 8, 13), device="cuda:0"), torch.zeros((1, 3, 16, 26), device="cuda:0"), 
                    torch.zeros((1, 3, 32, 52), device="cuda:0"), torch.zeros((1, 6552, 6), device="cuda:0")]
        bindings = [_input.data_ptr() for _input in inputs] + [_output.data_ptr() for _output in outputs]

        self.context.execute_v2(bindings)

The key here, as shown in the code above, is to first allocate the inputs tensor on the gpu BEFORE setting up a new cuda context (import pycuda.autoinit) and allocating the output tensors. This seems to work as desired and the original error isn't shown anymore, however I'm trying to understand what's going on here behind the scenes. Seems like torch has its own context that becomes inconsistent between inputs and outputs in the original code? Not too sure about this though.

Joevaen commented 4 years ago

You can potentially use PyTorch for your data management entirely. This is an example of using PyTorch Tensors for both the input and output buffers of the engine (as opposed to pycuda). https://github.com/NVIDIA-AI-IOT/torch2trt/blob/master/torch2trt/torch2trt.py#L206

I have a same issue, could you please describe it in detail for me? I can't find the answer in the link.Thanks!

Joevaen commented 4 years ago

@narendasan thanks for your help， my problem has been solved, but there is a new problem which I speed up the feature map, it just speed up 10%. it is normal ?

Could you please provide your solution? I meet the same question.