NVIDIA / VideoProcessingFramework

Set of Python bindings to C++ libraries which provides full HW acceleration for video decoding, encoding and GPU-accelerated color space and pixel format conversions
Apache License 2.0
1.31k stars 233 forks source link

weird compatibility issue with pytorh > 1.6 #203

Closed lferraz closed 3 years ago

lferraz commented 3 years ago

Describe the bug

In my server everything works perfect with the last version of VPF (with Video Codec SDK 9 or 10) and pytorch 1.6 + cuda 10.2 (I also tested with several drivers)

However now I am trying to update to pytorch 1.8.1 and there are problems when I use VPF. At some moment I get the next error: RuntimeError: cusolver error: 7, when calling cusolverDnSgetrs( handle, CUBLAS_OP_N, n, nrhs, dA, lda, ipiv, ret, ldb, info) this error appear when torch.inverse(input) is called.

I tested my project without using VPF and in this case everything works fine. The problem appears when I try to use VPF. The fact of adding the most simple code related with VPF (e.g. nvc.PyNvDecoder(data_source, 0) at the beginning of my script) generates the previous error.

To Reproduce I tried to reproduce this error in a small piece of code but I cannot.

Ideally this should fail but it is not failing :'(

import torch
import PyNvCodec as nvc
a = nvc.PyNvDecoder('video.mp4', 0)
torch.randn((3,3), device=torch.device('cuda:0')).inverse()

Desktop (please complete the following information):

I tested all the posible combinations cuda + pytorch + video SDK.

Additional context Looks like the problem is in the cublas library. I tried to compile VPF with cuda 10 and use it from cuda 11 and directly fails in the import, but this problem iwth cublas looks like it is quite hidden.

I feel it is quite complex to solve my problem but I'd like to get some feedback from you.

rarzumanyan commented 3 years ago

Hi @lferraz

Before we start digging too deep - have you rebuilt the VPF after the PyTorch update and did you check that latest PyTorch is used during the PytorchNvCodec target build?

lferraz commented 3 years ago

Hi @rarzumanyan , I am not using PytorchNvCodec I use the Import/export methods. And, to generate the error I do not need to try to import/export anything. Only creating a nvc.PyNvDecoderobject. :S

I recompiled the VPF inside the env I have for my project. In the cmake report everything make sense to me.

 The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /code/optima-workspace/vision/.dev_env/envs/vision/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The CUDA compiler identification is NVIDIA 11.3.58
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Searching for FFmpeg libs in /lib
-- Searching for FFmpeg headers in /include
-- Searching for Video Codec SDK headers in /code/Video_Codec_SDK_11.0.10/include folder
-- Searching for Video Codec SDK headers in /code/Video_Codec_SDK_11.0.10/Interface folder
-- Found PythonLibs:/code/optima-workspace/vision/.dev_env/envs/vision/lib/libpython3.7m.so (found suitable version "3.7.2", minimum required is "3.5") 
-- Found PythonInterp: /code/optima-workspace/vision/.dev_env/envs/vision/bin/python3.7 (found version "3.7.2") 
-- Found PythonLibs: /code/optima-workspace/vision/.dev_env/envs/vision/lib/libpython3.7m.so
-- pybind11 v2.3.dev0
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- LTO enabled
-- Configuring done
-- Generating done
-- Build files have been written to: /code/VideoProcessingFramework/build
lferraz commented 3 years ago

Hi @rarzumanyan After 2 days isolating the problem I got a minimal example:

if name == 'main': m = torch.eye(1, device=torch.device('cuda:0')) h = torch.stack([m, m, m])

vpf = nvc.PyNvDecoder('VIDEOPATH.mp4', 0)

a = m.inverse()
c = h.inverse()
b = m.inverse()


The error provided is:
`RuntimeError: cusolver error: 7, when calling `cusolverDnSgetrs( handle, CUBLAS_OP_N, n, nrhs, dA, lda, ipiv, ret, ldb, info)``

If you comment the vpf line there is no error.

I tested this script in 2 very similar machines at GCP with a V100 gpu.
lferraz commented 3 years ago

Let me add more info regarding the compilation of VPF:

CMAKE output:

-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The CUDA compiler identification is NVIDIA 11.3.58
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Searching for FFmpeg libs in /usr/lib/x86_64-linux-gnu/./lib
-- Searching for FFmpeg headers in /usr/lib/x86_64-linux-gnu/./include
-- Searching for Video Codec SDK headers in /home/luisferrazcolomina/code/Video_Codec_SDK_9.1.23/include folder
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.7m.so (found suitable version "3.7.5", minimum required is "3.5") 
-- Found PythonInterp: /home/luisferrazcolomina/code/optima-workspace/vision/.dev_env/envs/myenvX/bin/python3.7 (found version "3.7.2") 
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.7m.so
-- pybind11 v2.3.dev0
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- LTO enabled
-- Configuring done
-- Generating done
-- Build files have been written to: /home/luisferrazcolomina/code/VideoProcessingFramework/build

After the CMAKE I get: PyNvCodec.cpython-37m-x86_64-linux-gnu.so

rarzumanyan commented 3 years ago

Hi @rarzumanyan After 2 days isolating the problem I got a minimal example:

* First create a conda env with the minimum required packages (you can try with other python versions and also other pytorch version if they are > 1.6). Note: with this line cudatoolkit 10.2 is installed (installing with pip pytorch does not install the cudatoolkit but the error also appears).
  `conda create -n myenv python=3.7.2 pytorch=1.8.1 ipython`

* Run this script in e.g. `ipython`. You need to put valid paths for VPF_PATH and VIDEOPATH. I tested in several videos and always fail, e.g. the vp9 video we were using.
import torch
import sys
sys.path.append(VPF_PATH)
import PyNvCodec as nvc

if __name__ == '__main__':
    m = torch.eye(1, device=torch.device('cuda:0'))
    h = torch.stack([m, m, m])

    vpf = nvc.PyNvDecoder('VIDEOPATH.mp4', 0)

    a = m.inverse()
    c = h.inverse()
    b = m.inverse()

The error provided is: RuntimeError: cusolver error: 7, when callingcusolverDnSgetrs( handle, CUBLAS_OP_N, n, nrhs, dA, lda, ipiv, ret, ldb, info)``

If you comment the vpf line there is no error.

I tested this script in 2 very similar machines at GCP with a V100 gpu.

Hi @lferraz

Unfortunately I can't use Anaconda because their licensing has changed for corporate users some time ago. Allow me some time, I'll check up on my Ubuntu machine with vanilla Python 3.8. I'm currently merging feature branches into main, will take this up as soon as I'm done.

As far as I understand, this is "P1" kind of a issue which isn't a show stopper and you can work this around for some time, right?

lferraz commented 3 years ago

Hi @rarzumanyan ,

I propose conda because it is the easiest way, you can use any env. I also installed pytorch and cmake using env created with python -m venv and I still have the same problem.

I found a posible issue, vpf is compiled with python 3.7m and I am running 3.7... I will check now if that can be a problem

lferraz commented 3 years ago

Hi @rarzumanyan ,

already tested. Same issue with the m version of python. :( I do not know what else I can do on my side... if you have any idea please, let me know.

Anyway, thanks for your help :)

lferraz commented 3 years ago

Hi @rarzumanyan ,

I run dlprof on the code and i found some differences in the libs loaded when it works with pytorch 1.6 and when it does not work with pytorch 1.8.1.

I extracted this list of diffs using Nsight.

This are the libs that are not used in the case where everything works fine: torch/lib/libc10.so target-linux-x64/libToolsInjectionCuda64.so lib/libcusolver.so.10.3.0.89

This one changes its version: libstdc++.so.6.0.26. — libstdc++.so.6.0.28

Running ldd on vpf and libtorch, I've seen a difference in several libs. The main one maybe is: libcudart.so - vpf uses the version 11 and pytorch the 10.2.

Anyway the error I am getting looks like it is related with libcusolver

`RuntimeError: cusolver error: 7, when calling cusolverDnSgetrs( handle, CUBLAS_OP_N, n, nrhs, dA, lda, ipiv, ret, ldb, info)``

rarzumanyan commented 3 years ago

Hi @lferraz

I’m now merging the feature branch which actually removes CUDA runtime api from PyNvCodec. Let us check again once it’s merged.

Im planning to finish the merge tonight or tomorrow in the morning, will update you in this thread.

rarzumanyan commented 3 years ago

Hi @lferraz

Please check out latest master, it has changed merged from nvtx_support and shall no longer use CUDA runtime API in PyNvCodec.

lferraz commented 3 years ago

HI @rarzumanyan ,

unfortunately the issue is still there. I run ldd and I still can see the libcudart and libcuda dependencies.

libcudart.so.10.2
libcuda.so.1
rarzumanyan commented 3 years ago

Hi @lferraz

I'm now investigating into this issue but there's one blocker: you mention that CUDA 10.2 is reuired:

Note: with this line cudatoolkit 10.2 is installed

Which isn't enough to compile master ToT because of this function: https://github.com/NVIDIA/VideoProcessingFramework/blob/906b6dc43e6be99284c24e382cf5fc93196d99c7/PyNvCodec/TC/src/TasksColorCvt.cpp#L115-L116

which requires CUDA 11.0 at least. This addition is very important because it fixes BT.601 and BT.709 YUV -> RGB color conversion.

Without that VPF can't do a proper color conversion to RGB which is crucial for ML applications since majority of NN are trained on RGB datasets and inaccurate conversion to RGB badly hurts prediction accuracy.

Any chance you can upgrade to CUDA 11? If not I'll have to work around the color conversion thing first.

lferraz commented 3 years ago

HI @rarzumanyan ,

thanks for the update. I also tested with CUDA 11 and I got the same error :(

Luis

rarzumanyan commented 3 years ago

@lferraz

Ok, that means I go ahead with CUDA 11. Thanks for the update!

lferraz commented 3 years ago

@rarzumanyan , probably not useful to you but I also added to my compilation script this 2 lines to avoid posible inconsistencies with python.

export PYTHON_LIB=$(python -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))")
export PYTHON=$(which python)
cmake .. -DVIDEO_CODEC_SDK_DIR:PATH="$PATH_TO_SDK" -DGENERATE_PYTHON_BINDINGS:BOOL="1" -DCMAKE_INSTALL_PREFIX:PATH="$INSTALL_PREFIX" -DFFMPEG_DIR:PATH="$PATH_TO_FFMPEG" -DPYTHON_EXECUTABLE:PATH="$PYTHON" -DPYTHON_LIBRARY="$PYTHON_LIB"
rarzumanyan commented 3 years ago

@lferraz

I confirm that I can reproduce the issue on following config:

Will update you in this thread as soon as I find something.

rarzumanyan commented 3 years ago

@lferraz

Please checkout issue_203 ToT. I've replaced cuCtxCreate() with cuDevicePrimaryCtxRetain() and now following snippet no longer causes any errors:

import torch
import sys
import PyNvCodec as nvc

m = torch.eye(1, device=torch.device('cuda:0'))
h = torch.stack([m, m, m])
nvdec = nvc.PyNvDecoder('/home/roman/Videos/bbb_sunflower_1080p_30fps_normal.mp4', 0)

a = m.inverse()
c = h.inverse()
b = m.inverse()

quit()

I've also tested SampleDecode.py, it produces valid NV12 output so I assume that VPF functionality isn't broken.

P. S. I didn't conduct any performance investigation on how does primary CUDA context influences performance and such - the issue_203 branch is kinda hotfix.

lferraz commented 3 years ago

@rarzumanyan looks like it works!!! Tomorrow I will run a more deep validation. About the performance I have no idea... I do not know what's the difference between cuCtxCreate() and cuDevicePrimaryCtxRetain().

lferraz commented 3 years ago

@rarzumanyan , I tested on my pipeline and looks like everything works fine. I compared qualitatively the speed of VPF and it is similar (there are small speed differences but I feel it is because I tested it in two diff machines which are equal except because of the disk, one uses a ssd and the other one a hdd).

rarzumanyan commented 3 years ago

Thanks for the update @lferraz

Will merge issue_203 to master after some additional investigation on my side. Closing as solved.