artyom-beilis / pytorch_dlprim

DLPrimitives/OpenCL out of tree backend for pytorch
http://blog.dlprimitives.org/
MIT License
264 stars 17 forks source link

RuntimeError when running test.py #5

Closed cdasl closed 2 years ago

cdasl commented 2 years ago

After installing pytorch_dlprim, when python test.py, get error.

Accessing device #1:Tesla V100S-PCIE-32GB on NVIDIA CUDA
Traceback (most recent call last):
  File "test.py", line 16, in <module>
    t1=torch.ones((20,10),requires_grad=True,device=dev)
RuntimeError: clSetKernelArg
artyom-beilis commented 2 years ago

Do you have two GPUs?

This is some trivial internal test (not real test - there are better tests under test dir) but it is strange it fails.

What is output of clinfo?

What is output of:

 OPENCL_DEBUG_MODE=1 python test.py

And

 OPENCL_DEBUG_MODE=2 python test.py
artyom-beilis commented 2 years ago

Also what is your operating system? version of dlprimitives and pytorch_dlprim you can run git rev-parse HEAD to check

cdasl commented 2 years ago

clinfo output is

Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 10.2.95
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics
  Platform Extensions function suffix             NV

  Platform Name                                   NVIDIA CUDA
Number of devices                                 2
  Device Name                                     Tesla V100S-PCIE-32GB
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  440.33.01
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 3b:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               80
  Max clock frequency                             1597MHz
  Compute Capability (NV)                         7.0
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              34089730048 (31.75GiB)
  Error Correction support                        Yes
  Max memory allocation                           8522432512 (7.937GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        2621440 (2.5MiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            268435456 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             32768x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                32
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 No
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  7
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics

  Device Name                                     Tesla V100S-PCIE-32GB
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  440.33.01
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, d8:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               80
  Max clock frequency                             1597MHz
  Compute Capability (NV)                         7.0
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              34089730048 (31.75GiB)
  Error Correction support                        Yes
  Max memory allocation                           8522432512 (7.937GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        2621440 (2.5MiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            268435456 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             32768x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                32
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 No
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  7
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform
cdasl commented 2 years ago

OPENCL_DEBUG_MODE=1 python test.py output is

Accessing device #1:Tesla V100S-PCIE-32GB on NVIDIA CUDA
Exception from at::Tensor& ptdlprim::fill_(at::Tensor&, const c10::Scalar&)
Traceback (most recent call last):
  File "test.py", line 16, in <module>
    t1=torch.ones((20,10),requires_grad=True,device=dev)
RuntimeError: clSetKernelArg

OPENCL_DEBUG_MODE=2 python test.py output is

in:  at::Tensor ptdlprim::allocate_empty(c10::IntArrayRef, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>)
Accessing device #1:Tesla V100S-PCIE-32GB on NVIDIA CUDA
in:  at::Tensor& ptdlprim::fill_(at::Tensor&, const c10::Scalar&)
Traceback (most recent call last):
  File "test.py", line 16, in <module>
    t1=torch.ones((20,10),requires_grad=True,device=dev)
RuntimeError: clSetKernelArg
cdasl commented 2 years ago

OS is Ubuntu 18.04.1

pytorch_dlprim: 032a6933fed7e9e0cdcf0dbadf9c7f246804fa3b dlprimitives: 2867462c531d80d6404e616c7290b80c086f4620

artyom-beilis commented 2 years ago

Can you please run these in pytorch_dlprim:

python tests/test_op.py --device=opencl:0
python tests/validate_network.py --model mnist_cnn --device=opencl:0

And these in dlprimitives/build directory:

./test_net 0:0 ../tests/test_net.json ../tests/test_weights.json 
./test_context 0:0

And

./test_context 0:1
./test_net 0:1 ../tests/test_net.json ../tests/test_weights.json

Note different devices.

cdasl commented 2 years ago

python tests/test_op.py --device=opencl:0

/home/xxx/anaconda3/envs/pytorch_opencl/lib/python3.7/site-packages/torchvision/io/image.py:11: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Mean 1d
Accessing device #0:Tesla V100S-PCIE-32GB on NVIDIA CUDA
Traceback (most recent call last):
  File "tests/test_op.py", line 165, in <module>
    test_all(r.device)
  File "tests/test_op.py", line 93, in test_all
    test_fwd_bwd([([2,3,4],-1)],lambda x:torch.mean(x,dim=0,keepdim=True),device)
  File "tests/test_op.py", line 49, in test_fwd_bwd
    x_dev = x_cpu.to(device)
RuntimeError: clEnqueueWriteBuffer

python tests/validate_network.py --model mnist_cnn --device=opencl:0

/home/xxx/anaconda3/envs/pytorch_opencl/lib/python3.7/site-packages/torchvision/io/image.py:11: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Testing  mnist_cnn
Accessing device #0:Tesla V100S-PCIE-32GB on NVIDIA CUDA
Traceback (most recent call last):
  File "tests/validate_network.py", line 272, in <module>
    main(r)
  File "tests/validate_network.py", line 220, in main
    train_on_images(m,batch,args.device,args.eval,iter_size = args.iter_size,opt_steps = args.opt,fwd=args.fwd)
  File "tests/validate_network.py", line 95, in train_on_images
    data_dev = data.to(device)
RuntimeError: clEnqueueWriteBuffer
cdasl commented 2 years ago

./test_net 0:0 ../tests/test_net.json ../tests/test_weights.json

Testing for Tesla V100S-PCIE-32GB on NVIDIA CUDA
Checking Diffs
Testing bn2a.2
Testing bn2a.3
Testing cnv1.0
Testing cnv1.1
Testing cnv2.0
Testing cnv2.1
Testing cnv2a.0
Testing cnv2b.0
Testing cnv2b.1
Testing fc.0
Testing fc.1
Checking Param Updates
Testing bn2a.0
Testing bn2a.1
Testing bn2a.2
Testing bn2a.3
Testing cnv1.0
Testing cnv1.1
Testing cnv2.0
Testing cnv2.1
Testing cnv2a.0
Testing cnv2b.0
Testing cnv2b.1
Testing fc.0
Testing fc.1

./test_context 0:0

Basic context
Tesla V100S-PCIE-32GB on NVIDIA CUDA
Ok

./test_context 0:1

Basic context
Tesla V100S-PCIE-32GB on NVIDIA CUDA
Ok

./test_net 0:1 ../tests/test_net.json ../tests/test_weights.json

Testing for Tesla V100S-PCIE-32GB on NVIDIA CUDA
Checking Diffs
Testing bn2a.2
Testing bn2a.3
Testing cnv1.0
Testing cnv1.1
Testing cnv2.0
Testing cnv2.1
Testing cnv2a.0
Testing cnv2b.0
Testing cnv2b.1
Testing fc.0
Testing fc.1
Checking Param Updates
Testing bn2a.0
Testing bn2a.1
Testing bn2a.2
Testing bn2a.3
Testing cnv1.0
Testing cnv1.1
Testing cnv2.0
Testing cnv2.1
Testing cnv2a.0
Testing cnv2b.0
Testing cnv2b.1
Testing fc.0
Testing fc.1
cdasl commented 2 years ago

I find the problem. Just set USE_CL_HPP=OFF and all can run normally. Thanks for your project, it is really awesome!!

artyom-beilis commented 2 years ago

I find the problem. Just set USE_CL_HPP=OFF and all can run normally. Thanks for your project, it is really awesome!!

Thanks, interesting I'll check. I expected that it should work with both CL/cl.hpp and CL/cl2.hpp I'll check it.