artyom-beilis / pytorch_dlprim

DLPrimitives/OpenCL out of tree backend for pytorch
http://blog.dlprimitives.org/
MIT License
284 stars 17 forks source link

When I test mnist.py after completing make, I get an error #58

Open ParadosBoy opened 8 months ago

ParadosBoy commented 8 months ago
Using device: ocl:0
Accessing device #0:PowerVR B-Series BXM-4-64 on PowerVR
Traceback (most recent call last):
  File "/home/sipeed/Desktop/pytorch_dlprim/mnist.py", line 163, in <module>
    main()
  File "/home/sipeed/Desktop/pytorch_dlprim/mnist.py", line 154, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/sipeed/Desktop/pytorch_dlprim/mnist.py", line 56, in train
    loss.backward()
  File "/home/sipeed/Desktop/lzm/pytorch/lzm/lib/python3.11/site-packages/torch/_tensor.py", line 524, in backward
    torch.autograd.backward(
  File "/home/sipeed/Desktop/lzm/pytorch/lzm/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/sipeed/Desktop/lzm/pytorch/lzm/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Please register PrivateUse1HooksInterface by `RegisterPrivateUse1HooksInterface` first.

I'm not sure if it's a gpu problem or a pytorch version compatibility problem This is my configuration information

CMake Warning at /home/sipeed/Desktop/lzm/pytorch/lzm/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  /home/sipeed/Desktop/lzm/pytorch/lzm/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
  CMakeLists.txt:4 (find_package)

=== Status ===
  OpenCL: include /usr/include
          lib     /usr/lib/riscv64-linux-gnu/libOpenCL.so
  Python: /home/sipeed/Desktop/lzm/pytorch/lzm/bin/python3
  BLAS: None
  HDF5: None
  Sqlite3: disabled
  Protobuf (onnx): disabled
  Python dlprim: disabled
-- Configuring done
-- Generating done
-- Build files have been written to: /home/sipeed/Desktop/pytorch_dlprim/build

Did I miss some required dependencies?

artyom-beilis commented 8 months ago

Ok.

Two points, 1st lets check that dlprimiteves themselves work. It wasn't tested on this GPU type. Can you please build dlprimitives - run tests and benchmarks.

Also please post clinfo output.

Once we past it we'll go back to pytorch to figure out what is the issue. Which version of pytorch do you use?

ParadosBoy commented 8 months ago
sipeed@lpi4a:~$ clinfo
Number of platforms                               1
  Platform Name                                   PowerVR
  Platform Vendor                                 Imagination Technologies
  Platform Version                                OpenCL 3.0
  Platform Profile                                EMBEDDED_PROFILE
  Platform Extensions                             cl_khr_icd cl_khr_fp16 cl_img_spirv cles_khr_int64 cl_img_yuv_image cl_khr_device_uuid cl_khr_depth_images cl_khr_mipmap_image cl_khr_priority_hints cl_img_generate_mipmap cl_khr_3d_image_writes cl_img_cached_allocations cl_khr_mipmap_image_writes cl_khr_create_command_queue cl_khr_suggested_local_work_size cl_img_mem_properties cl_img_mem_properties_relax_alloc_requirements cl_khr_extended_versioning cl_khr_image2d_from_buffer cl_khr_byte_addressable_store cl_khr_local_int32_base_atomics cl_khr_global_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_global_int32_extended_atomics cl_khr_spir cl_khr_il_program cl_khr_egl_image cl_arm_import_memory cl_arm_import_memory_dma_buf cl_img_protected_content cl_img_semaphore cl_img_external_semaphore cl_img_external_semaphore_sync_fd cl_khr_semaphore cl_khr_external_semaphore cl_khr_external_semaphore_sync_fd
  Platform Extensions with Version                cl_khr_icd                                                       0x400000 (1.0.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_img_spirv                                                     0x400000 (1.0.0)
                                                  cles_khr_int64                                                   0x400000 (1.0.0)
                                                  cl_img_yuv_image                                                 0x400000 (1.0.0)
                                                  cl_khr_device_uuid                                               0x400000 (1.0.0)
                                                  cl_khr_depth_images                                              0x400000 (1.0.0)
                                                  cl_khr_mipmap_image                                              0x400000 (1.0.0)
                                                  cl_khr_priority_hints                                            0x400000 (1.0.0)
                                                  cl_img_generate_mipmap                                           0x400000 (1.0.0)
                                                  cl_khr_3d_image_writes                                           0x400000 (1.0.0)
                                                  cl_img_cached_allocations                                        0x400000 (1.0.0)
                                                  cl_khr_mipmap_image_writes                                       0x400000 (1.0.0)
                                                  cl_khr_create_command_queue                                      0x400000 (1.0.0)
                                                  cl_khr_suggested_local_work_size                                 0x400000 (1.0.0)
                                                  cl_img_mem_properties                                            0x400000 (1.0.0)
                                                  cl_img_mem_properties_relax_alloc_requirements                   0x400000 (1.0.0)
                                                  cl_khr_extended_versioning                                       0x400000 (1.0.0)
                                                  cl_khr_image2d_from_buffer                                       0x400000 (1.0.0)
                                                  cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_spir                                                      0x400000 (1.0.0)
                                                  cl_khr_il_program                                                0x400000 (1.0.0)
                                                  cl_khr_egl_image                                                 0x400000 (1.0.0)
                                                  cl_arm_import_memory                                             0x400000 (1.0.0)
                                                  cl_arm_import_memory_dma_buf                                     0x400000 (1.0.0)
                                                  cl_img_protected_content                                         0x400000 (1.0.0)
                                                  cl_img_semaphore                                                 0x400000 (1.0.0)
                                                  cl_img_external_semaphore                                        0x400000 (1.0.0)
                                                  cl_img_external_semaphore_sync_fd                                0x400000 (1.0.0)
                                                  cl_khr_semaphore                                                 0x400000 (1.0.0)
                                                  cl_khr_external_semaphore                                        0x400000 (1.0.0)
                                                  cl_khr_external_semaphore_sync_fd                                0x400000 (1.0.0)
  Platform Numeric Version                        0xc00000 (3.0.0)
  Platform Extensions function suffix             IMG
  Platform Host timer resolution                  0ns
  Platform Semaphore types                        Binary
  Platform External semaphore import types        <gatherPlatformInfo:12: get CL_PLATFORM_SEMAPHORE_IMPORT_HANDLE_TYPES_KHR size : error -30>
  Platform External semaphore export types        <gatherPlatformInfo:13: get CL_PLATFORM_SEMAPHORE_EXPORT_HANDLE_TYPES_KHR size : error -30>

  Platform Name                                   PowerVR
Number of devices                                 1
  Device Name                                     PowerVR B-Series BXM-4-64
  Device Vendor                                   Imagination Technologies
  Device Vendor ID                                0x1010
  Device Version                                  OpenCL 3.0
  Device UUID                                     33362035-3220-3130-3420-313832000000
  Driver UUID                                     36323130-3836-3600-0000-000000000000
  Valid Device LUID                               No
  Device LUID                                     0000-000000000000
  Device Node Mask                                0
  Device Numeric Version                          0xc00000 (3.0.0)
  Driver Version                                  1.17@6210866
  Device OpenCL C Version                         OpenCL C 1.2
  Device OpenCL C Numeric Version                 0x402000 (1.2.0)
  Device OpenCL C all versions                    OpenCL C                                                         0x400000 (1.0.0)
                                                  OpenCL C                                                         0x401000 (1.1.0)
                                                  OpenCL C                                                         0x402000 (1.2.0)
                                                  OpenCL C                                                         0xc00000 (3.0.0)
  Device OpenCL C features                        __opencl_c_int64                                                 0x400000 (1.0.0)
                                                  __opencl_c_pipes                                                 0xc00000 (3.0.0)
                                                  __opencl_c_images                                                0x400000 (1.0.0)
                                                  __opencl_c_subgroups                                             0xc00000 (3.0.0)
                                                  __opencl_c_3d_image_writes                                       0x400000 (1.0.0)
                                                  __opencl_c_read_write_images                                     0x400000 (1.0.0)
                                                  __opencl_c_generic_address_space                                 0xc00000 (3.0.0)
                                                  __opencl_c_program_scope_global_variables                        0xc00000 (3.0.0)
                                                  __opencl_c_work_group_collective_functions                       0xc00000 (3.0.0)
  Latest conformance test passed                  v2021-10-04-00
  Device Type                                     GPU
  Device Profile                                  EMBEDDED_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               1
  Max clock frequency                             792MHz
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             512x512x512
  Max work group size                             512
  Preferred work group size multiple (device)     32
  Preferred work group size multiple (kernel)     32
  Max sub-groups per work group                   512
  Preferred / native vector sizes
    char                                                16 / 1
    short                                                8 / 1
    int                                                  4 / 1
    long                                                 2 / 1
    half                                                 0 / 0        (cl_khr_fp16)
    float                                                4 / 1
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    64, Little-Endian
  Semaphore types                                 <printDeviceInfo:105: get number of CL_DEVICE_SEMAPHORE_TYPES_KHR : error -30>
  External semaphore import types                 <printDeviceInfo:106: get number of CL_DEVICE_SEMAPHORE_IMPORT_HANDLE_TYPES_KHR : error -30>
  External semaphore export types                 <printDeviceInfo:107: get number of CL_DEVICE_SEMAPHORE_EXPORT_HANDLE_TYPES_KHR : error -30>
  Global memory size                              16503758848 (15.37GiB)
  Error Correction support                        No
  Max memory allocation                           4125939712 (3.843GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   No
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Atomic memory capabilities                      relaxed, work-group scope
  Atomic fence capabilities                       relaxed, acquire/release, work-group scope
  Max size for global variable                    16384 (16KiB)
  Preferred total size of global vars             0
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            16384 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   64 bytes
    Pitch alignment for 2D image buffers          64 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             16384x16384x2048 pixels
    Max number of read image args                 8
    Max number of write image args                64
    Max number of read/write image args           64
  Pipe support                                    Yes
  Max number of pipe args                         16
  Max active pipe reservations                    1
  Max pipe packet size                            1024
  Local memory type                               Local
  Local memory size                               4096 (4KiB)
  Max number of constant args                     256
  Max constant buffer size                        4125939712 (3.843GiB)
  Generic address space support                   Yes
  Max size of kernel argument                     1024
  Queue properties (on host)
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Device enqueue capabilities                     (n/a)
  Queue properties (on device)
    Out-of-order execution                        No
    Profiling                                     No
    Preferred size                                0
    Max size                                      0
  Max queues on device                            0
  Max events on device                            0
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1000ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    Non-uniform work-groups                       Yes
    Work-group collective functions               Yes
    Sub-group independent forward progress        No
    IL version                                    SPIR-V_1.2
    ILs with version                              SPIR-V                                                           0x402000 (1.2.0)
    SPIR versions                                 1.2
  printf() buffer size                            65536 (64KiB)
  Built-in kernels                                (n/a)
  Built-in kernels with version                   (n/a)
  Device Extensions                               cl_khr_icd cl_khr_fp16 cl_img_spirv cles_khr_int64 cl_img_yuv_image cl_khr_device_uuid cl_khr_depth_images cl_khr_mipmap_image cl_khr_priority_hints cl_img_generate_mipmap cl_khr_3d_image_writes cl_img_cached_allocations cl_khr_mipmap_image_writes cl_khr_create_command_queue cl_khr_suggested_local_work_size cl_img_mem_properties cl_img_mem_properties_relax_alloc_requirements cl_khr_extended_versioning cl_khr_image2d_from_buffer cl_khr_byte_addressable_store cl_khr_local_int32_base_atomics cl_khr_global_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_global_int32_extended_atomics cl_khr_spir cl_khr_il_program cl_khr_egl_image cl_arm_import_memory cl_arm_import_memory_dma_buf cl_img_protected_content cl_img_semaphore cl_img_external_semaphore cl_img_external_semaphore_sync_fd cl_khr_semaphore cl_khr_external_semaphore cl_khr_external_semaphore_sync_fd
  Device Extensions with Version                  cl_khr_icd                                                       0x400000 (1.0.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_img_spirv                                                     0x400000 (1.0.0)
                                                  cles_khr_int64                                                   0x400000 (1.0.0)
                                                  cl_img_yuv_image                                                 0x400000 (1.0.0)
                                                  cl_khr_device_uuid                                               0x400000 (1.0.0)
                                                  cl_khr_depth_images                                              0x400000 (1.0.0)
                                                  cl_khr_mipmap_image                                              0x400000 (1.0.0)
                                                  cl_khr_priority_hints                                            0x400000 (1.0.0)
                                                  cl_img_generate_mipmap                                           0x400000 (1.0.0)
                                                  cl_khr_3d_image_writes                                           0x400000 (1.0.0)
                                                  cl_img_cached_allocations                                        0x400000 (1.0.0)
                                                  cl_khr_mipmap_image_writes                                       0x400000 (1.0.0)
                                                  cl_khr_create_command_queue                                      0x400000 (1.0.0)
                                                  cl_khr_suggested_local_work_size                                 0x400000 (1.0.0)
                                                  cl_img_mem_properties                                            0x400000 (1.0.0)
                                                  cl_img_mem_properties_relax_alloc_requirements                   0x400000 (1.0.0)
                                                  cl_khr_extended_versioning                                       0x400000 (1.0.0)
                                                  cl_khr_image2d_from_buffer                                       0x400000 (1.0.0)
                                                  cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_spir                                                      0x400000 (1.0.0)
                                                  cl_khr_il_program                                                0x400000 (1.0.0)
                                                  cl_khr_egl_image                                                 0x400000 (1.0.0)
                                                  cl_arm_import_memory                                             0x400000 (1.0.0)
                                                  cl_arm_import_memory_dma_buf                                     0x400000 (1.0.0)
                                                  cl_img_protected_content                                         0x400000 (1.0.0)
                                                  cl_img_semaphore                                                 0x400000 (1.0.0)
                                                  cl_img_external_semaphore                                        0x400000 (1.0.0)
                                                  cl_img_external_semaphore_sync_fd                                0x400000 (1.0.0)
                                                  cl_khr_semaphore                                                 0x400000 (1.0.0)
                                                  cl_khr_external_semaphore                                        0x400000 (1.0.0)
                                                  cl_khr_external_semaphore_sync_fd                                0x400000 (1.0.0)

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  PowerVR
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [IMG]
  clCreateContext(NULL, ...) [default]            Success [IMG]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 PowerVR
    Device Name                                   PowerVR B-Series BXM-4-64
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 PowerVR
    Device Name                                   PowerVR B-Series BXM-4-64
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 PowerVR
    Device Name                                   PowerVR B-Series BXM-4-64

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.3.1
  ICD loader Profile                              OpenCL 3.0

The cases in dlprimitives also don't seem to work

Traceback (most recent call last):
  File "/home/sipeed/Desktop/dlprimitives/examples/python/mnist/train_mnist.py", line 8, in <module>
    import dlprim as dp
ModuleNotFoundError: No module named 'dlprim'

Maybe my gpu is not suitable for this framework

artyom-beilis commented 8 months ago

Don't run python.

Run dlprim_flops

Something like:

./dlprim_flops 0:0 1.0

And run make test to see if tests pass (it may take some time)

ParadosBoy commented 8 months ago
sipeed@lpi4a:~/Desktop/dlprimitives/build$ ./dlprim_flops 0:0 1.0
Testing on PowerVR B-Series BXM-4-64 on PowerVR
Testing memory speed
- Vector size 1
-- Warming
-- Running   2.78029 GB/s
- Vector size 2
-- Warming
-- Running   4.53877 GB/s
- Vector size 4
-- Warming
-- Running   8.03852 GB/s
- Vector size 8
-- Warming
-- Running   4.89552 GB/s
- Vector size 16
-- Warming
-- Running   5.01402 GB/s
Testing flops float
- Vector size 1
-- Warming
-- Running   46.4071 GFlops
- Vector size 2
-- Warming
-- Running   42.1022 GFlops
- Vector size 4
-- Warming
-- Running   37.9592 GFlops
- Vector size 8
-- Warming
-- Running   36.6869 GFlops
- Vector size 16
-- Warming
-- Running   35.4454 GFlops
Testing flops half
- Vector size 1
-- Warming
-- Running   46.3838 GFlops
- Vector size 2
-- Warming
-- Running   44.5784 GFlops
- Vector size 4
-- Warming
-- Running   37.1824 GFlops
- Vector size 8
-- Warming
-- Running   37.9947 GFlops
- Vector size 16
-- Warming
-- Running   35.8126 GFlops
Summray for PowerVR B-Series BXM-4-64 on PowerVR
Peak GFlops for float 46.4071
Peak GFlops for half 46.3838
Peak memory 8.03852 GB/s
GEMM
  NN  0:  512,  512,  512        9.3 GFlops (19.96%)      0.1 GB/s ( 2.17%) limited by gflops 19.96%
  NN  1: 1024, 1024, 1024        9.4 GFlops (20.18%)      0.1 GB/s ( 1.09%) limited by gflops 20.18%
  NN  2: 1025, 1025, 1025        8.4 GFlops (18.07%)      0.0 GB/s ( 0.98%) limited by gflops 18.07%
  NN  3: 2048, 2048, 2048        8.8 GFlops (18.96%)      0.0 GB/s ( 0.51%) limited by gflops 18.96%
  NN  4: 2049, 2049, 2049        8.4 GFlops (18.02%)      0.0 GB/s ( 0.49%) limited by gflops 18.02%
  NN  5:   64, 2048,   64        4.9 GFlops (10.59%)      0.3 GB/s ( 6.27%) limited by gflops 10.59%
  NN  6: 2048,   64, 2048        9.7 GFlops (20.91%)      0.3 GB/s ( 6.43%) limited by gflops 20.91%
  NN  7: 2048, 2048,   64        6.6 GFlops (14.22%)      0.2 GB/s ( 4.40%) limited by gflops 14.22%
  NN  8: 2048,   64,   64        5.2 GFlops (11.13%)      0.3 GB/s ( 6.59%) limited by gflops 11.13%
  NN  9:   64, 2048, 2048        8.7 GFlops (18.80%)      0.3 GB/s ( 5.78%) limited by gflops 18.80%
  NN 10:   64,   64, 2048        9.4 GFlops (20.27%)      0.6 GB/s (11.91%) limited by gflops 20.27%
  NT  0:  512,  512,  512       11.9 GFlops (25.59%)      0.1 GB/s ( 2.78%) limited by gflops 25.59%
  NT  1: 1024, 1024, 1024       12.5 GFlops (26.90%)      0.1 GB/s ( 1.46%) limited by gflops 26.90%
  NT  2: 1025, 1025, 1025       10.4 GFlops (22.44%)      0.1 GB/s ( 1.22%) limited by gflops 22.44%
  NT  3: 2048, 2048, 2048       12.0 GFlops (25.85%)      0.0 GB/s ( 0.70%) limited by gflops 25.85%
  NT  4: 2049, 2049, 2049       10.7 GFlops (23.08%)      0.0 GB/s ( 0.63%) limited by gflops 23.08%
  NT  5:   64, 2048,   64        5.7 GFlops (12.20%)      0.4 GB/s ( 7.22%) limited by gflops 12.20%
  NT  6: 2048,   64, 2048       11.9 GFlops (25.73%)      0.4 GB/s ( 7.91%) limited by gflops 25.73%
  NT  7: 2048, 2048,   64        8.1 GFlops (17.51%)      0.3 GB/s ( 5.42%) limited by gflops 17.51%
  NT  8: 2048,   64,   64        5.7 GFlops (12.35%)      0.4 GB/s ( 7.31%) limited by gflops 12.35%
  NT  9:   64, 2048, 2048       11.9 GFlops (25.66%)      0.4 GB/s ( 7.89%) limited by gflops 25.66%
  NT 10:   64,   64, 2048       11.3 GFlops (24.42%)      0.7 GB/s (14.35%) limited by gflops 24.42%
  TN  0:  512,  512,  512        7.2 GFlops (15.55%)      0.1 GB/s ( 1.69%) limited by gflops 15.55%
  TN  1: 1024, 1024, 1024        7.3 GFlops (15.75%)      0.0 GB/s ( 0.85%) limited by gflops 15.75%
  TN  2: 1025, 1025, 1025        6.7 GFlops (14.34%)      0.0 GB/s ( 0.78%) limited by gflops 14.34%
  TN  3: 2048, 2048, 2048        7.2 GFlops (15.53%)      0.0 GB/s ( 0.42%) limited by gflops 15.53%
  TN  4: 2049, 2049, 2049        7.0 GFlops (15.00%)      0.0 GB/s ( 0.41%) limited by gflops 15.00%
  TN  5:   64, 2048,   64        4.4 GFlops ( 9.48%)      0.3 GB/s ( 5.62%) limited by gflops  9.48%
  TN  6: 2048,   64, 2048        7.6 GFlops (16.34%)      0.3 GB/s ( 5.02%) limited by gflops 16.34%
  TN  7: 2048, 2048,   64        5.5 GFlops (11.85%)      0.2 GB/s ( 3.67%) limited by gflops 11.85%
  TN  8: 2048,   64,   64        4.5 GFlops ( 9.75%)      0.3 GB/s ( 5.78%) limited by gflops  9.75%
  TN  9:   64, 2048, 2048        7.4 GFlops (15.97%)      0.2 GB/s ( 4.91%) limited by gflops 15.97%
  TN 10:   64,   64, 2048        7.7 GFlops (16.51%)      0.5 GB/s ( 9.70%) limited by gflops 16.51%
  TT  0:  512,  512,  512        9.5 GFlops (20.51%)      0.1 GB/s ( 2.23%) limited by gflops 20.51%
  TT  1: 1024, 1024, 1024        9.6 GFlops (20.74%)      0.1 GB/s ( 1.13%) limited by gflops 20.74%
  TT  2: 1025, 1025, 1025        8.5 GFlops (18.31%)      0.0 GB/s ( 0.99%) limited by gflops 18.31%
  TT  3: 2048, 2048, 2048        8.9 GFlops (19.18%)      0.0 GB/s ( 0.52%) limited by gflops 19.18%
  TT  4: 2049, 2049, 2049        8.5 GFlops (18.27%)      0.0 GB/s ( 0.50%) limited by gflops 18.27%
  TT  5:   64, 2048,   64        5.1 GFlops (11.02%)      0.3 GB/s ( 6.52%) limited by gflops 11.02%
  TT  6: 2048,   64, 2048        9.0 GFlops (19.50%)      0.3 GB/s ( 5.99%) limited by gflops 19.50%
  TT  7: 2048, 2048,   64        6.6 GFlops (14.27%)      0.2 GB/s ( 4.42%) limited by gflops 14.27%
  TT  8: 2048,   64,   64        5.1 GFlops (10.95%)      0.3 GB/s ( 6.49%) limited by gflops 10.95%
  TT  9:   64, 2048, 2048        9.8 GFlops (21.12%)      0.3 GB/s ( 6.49%) limited by gflops 21.12%
  TT 10:   64,   64, 2048        9.5 GFlops (20.42%)      0.6 GB/s (12.00%) limited by gflops 20.42%
Convolution
   0     effnet  forward b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14        1.6 GFlops ( 3.41%)      0.7 GB/s (14.02%)
 limited by memory 14.02% algo=depthwise_separable
   0     effnet bwd-data b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14        0.0 GFlops ( 0.07%)      0.0 GB/s ( 0.29%)
 limited by memory  0.29% algo=depthwise_separable
   0     effnet bwd-filt b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14        0.3 GFlops ( 0.68%)      0.1 GB/s ( 2.79%)
 limited by memory  2.79% algo=depthwise_separable
   1    alexnet  forward b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224       4.4 GFlops ( 9.48%)      0.0 GB/s ( 0.86%)
 limited by gflops  9.48% algo=gemm
   1    alexnet bwd-data b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224       1.0 GFlops ( 2.15%)      0.0 GB/s ( 0.20%)
 limited by gflops  2.15% algo=gemm
   1    alexnet bwd-filt b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224       3.4 GFlops ( 7.29%)      0.0 GB/s ( 0.66%)
 limited by gflops  7.29% algo=gemm
   2    alexnet  forward b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27        3.6 GFlops ( 7.66%)      0.0 GB/s ( 0.18%)
 limited by gflops  7.66% algo=gemm
   2    alexnet bwd-data b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27        1.2 GFlops ( 2.57%)      0.0 GB/s ( 0.06%)
 limited by gflops  2.57% algo=gemm
   2    alexnet bwd-filt b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27        3.0 GFlops ( 6.41%)      0.0 GB/s ( 0.15%)
 limited by gflops  6.41% algo=gemm
   3    alexnet  forward b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27        5.1 GFlops (11.02%)      0.0 GB/s ( 0.17%)
 limited by gflops 11.02% algo=gemm
   3    alexnet bwd-data b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27        1.6 GFlops ( 3.43%)      0.0 GB/s ( 0.05%)
 limited by gflops  3.43% algo=gemm
   3    alexnet bwd-filt b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27        3.9 GFlops ( 8.45%)      0.0 GB/s ( 0.14%)
 limited by gflops  8.45% algo=gemm
   4    alexnet  forward b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13        5.1 GFlops (11.08%)      0.0 GB/s ( 0.17%)
 limited by gflops 11.08% algo=gemm
   4    alexnet bwd-data b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13        1.1 GFlops ( 2.29%)      0.0 GB/s ( 0.03%)
 limited by gflops  2.29% algo=gemm
   4    alexnet bwd-filt b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13        4.1 GFlops ( 8.74%)      0.0 GB/s ( 0.15%)
 limited by gflops  8.74% algo=gemm
   5     resnet  forward b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224       3.5 GFlops ( 7.63%)      0.1 GB/s ( 1.14%)
 limited by gflops  7.63% algo=gemm
   5     resnet bwd-data b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224       1.0 GFlops ( 2.17%)      0.0 GB/s ( 0.32%)
 limited by gflops  2.17% algo=gemm
   5     resnet bwd-filt b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224       3.0 GFlops ( 6.39%)      0.0 GB/s ( 0.96%)
 limited by gflops  6.39% algo=gemm
   6     resnet  forward b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56        5.7 GFlops (12.31%)      0.2 GB/s ( 4.45%)
 limited by gflops 12.31% algo=gemm
   6     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56        5.6 GFlops (12.11%)      0.2 GB/s ( 4.38%)
 limited by gflops 12.11% algo=gemm
   6     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56        7.0 GFlops (15.11%)      0.3 GB/s ( 5.47%)
 limited by gflops 15.11% algo=gemm
   7     resnet  forward b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56        5.8 GFlops (12.55%)      0.4 GB/s ( 7.26%)
 limited by gflops 12.55% algo=gemm
   7     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56        3.5 GFlops ( 7.51%)      0.2 GB/s ( 4.34%)
 limited by gflops  7.51% algo=gemm
   7     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56        7.0 GFlops (15.09%)      0.4 GB/s ( 8.73%)
 limited by gflops 15.09% algo=gemm
   8     resnet  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56        4.7 GFlops (10.15%)      0.0 GB/s ( 0.65%)
 limited by gflops 10.15% algo=gemm
   8     resnet bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56        1.0 GFlops ( 2.13%)      0.0 GB/s ( 0.14%)
 limited by gflops  2.13% algo=gemm
   8     resnet bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56        3.9 GFlops ( 8.48%)      0.0 GB/s ( 0.55%)
 limited by gflops  8.48% algo=gemm
   9     resnet  forward b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14        7.8 GFlops (16.73%)      0.1 GB/s ( 1.01%)
 limited by gflops 16.73% algo=gemm
   9     resnet bwd-data b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14        4.3 GFlops ( 9.27%)      0.0 GB/s ( 0.56%)
 limited by gflops  9.27% algo=gemm
   9     resnet bwd-filt b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14        3.7 GFlops ( 8.01%)      0.0 GB/s ( 0.53%)
 limited by gflops  8.01% algo=gemm
  10     resnet  forward b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14        8.3 GFlops (17.98%)      0.1 GB/s ( 1.65%)
 limited by gflops 17.98% algo=gemm
  10     resnet bwd-data b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14        5.7 GFlops (12.29%)      0.1 GB/s ( 1.13%)
 limited by gflops 12.29% algo=gemm
  10     resnet bwd-filt b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14        7.1 GFlops (15.31%)      0.1 GB/s ( 1.43%)
 limited by gflops 15.31% algo=gemm
  11     resnet  forward b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14        5.2 GFlops (11.11%)      0.0 GB/s ( 0.19%)
 limited by gflops 11.11% algo=gemm
  11     resnet bwd-data b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14        1.0 GFlops ( 2.15%)      0.0 GB/s ( 0.04%)
 limited by gflops  2.15% algo=gemm
  11     resnet bwd-filt b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14        4.0 GFlops ( 8.53%)      0.0 GB/s ( 0.16%)
 limited by gflops  8.53% algo=gemm
  12        vgg  forward b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224       2.1 GFlops ( 4.59%)      0.2 GB/s ( 3.29%)
 limited by gflops  4.59% algo=gemm
  12        vgg bwd-data b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224       1.0 GFlops ( 2.13%)      0.1 GB/s ( 1.53%)
 limited by gflops  2.13% algo=gemm
  12        vgg bwd-filt b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224       1.7 GFlops ( 3.60%)      0.1 GB/s ( 2.58%)
 limited by gflops  3.60% algo=gemm
  13        vgg  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224      13.2 GFlops (28.35%)      0.1 GB/s ( 1.82%)
 limited by gflops 28.35% algo=gemm
  13        vgg bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224      13.2 GFlops (28.35%)      0.1 GB/s ( 1.82%)
 limited by gflops 28.35% algo=gemm
  13        vgg bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224      13.2 GFlops (28.35%)      0.1 GB/s ( 1.82%)
 limited by gflops 28.35% algo=gemm
  14        vgg  forward b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28       13.2 GFlops (28.35%)      0.0 GB/s ( 0.24%)
 limited by gflops 28.35% algo=gemm
  14        vgg bwd-data b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28       13.2 GFlops (28.35%)      0.0 GB/s ( 0.24%)
 limited by gflops 28.35% algo=gemm
  14        vgg bwd-filt b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28       13.2 GFlops (28.35%)      0.0 GB/s ( 0.25%)
 limited by gflops 28.35% algo=gemm
  15     mobile  forward b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224       1.2 GFlops ( 2.54%)      0.1 GB/s ( 2.39%)
 limited by gflops  2.54% algo=gemm
  15     mobile bwd-data b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224       0.3 GFlops ( 0.73%)      0.0 GB/s ( 0.69%)
 limited by gflops  0.73% algo=gemm
  15     mobile bwd-filt b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224       0.9 GFlops ( 1.88%)      0.1 GB/s ( 1.77%)
 limited by gflops  1.88% algo=gemm
  16     mobile  forward b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56        1.5 GFlops ( 3.34%)      0.7 GB/s (13.73%)
 limited by memory 13.73% algo=depthwise_separable
  16     mobile bwd-data b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56        0.1 GFlops ( 0.32%)      0.1 GB/s ( 1.30%)
 limited by memory  1.30% algo=depthwise_separable
  16     mobile bwd-filt b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56        0.2 GFlops ( 0.42%)      0.1 GB/s ( 1.74%)
 limited by memory  1.74% algo=depthwise_separable
  17     mobile  forward b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56        0.1 GFlops ( 0.16%)      0.1 GB/s ( 1.60%)
 limited by memory  1.60% algo=gemm
  17     mobile bwd-data b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56
       0.0 GFlops ( 0.08%)      0.0 GB/s ( 0.80%) limited by memory  0.80% algo=gemm
  17     mobile bwd-filt b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56

It seems to be stuck in this position.

artyom-beilis commented 8 months ago

46GFlops is very few - it is a very weak. What is the platform are you running? Is it something ARM based? Because even old Intel CPU chip has much better performance and more GFlops.

So it isn't clear what benefit of using GPU you will have.

ParadosBoy commented 8 months ago

emmm... This is a GPU designed for risc-v architecture, just for testing compatibility

Feefkroete commented 6 months ago

Actually, I'm getting the exactly same RuntimeError as the initial post.

Using device: ocl:0 Accessing device #0:AMD Radeon RX 570 Series (radeonsi, polaris10, LLVM 17.0.6, DRM 3.57, 6.8.7-artix1-2) on Clover Traceback (most recent call last): File "/home/Python/dlprim_test/pytorch_dlprim/mnist.py", line 162, in main() File "/home/Python/dlprim_test/pytorch_dlprim/mnist.py", line 153, in main train(args, model, device, train_loader, optimizer, epoch) File "/home/Python/dlprim_test/pytorch_dlprim/mnist.py", line 55, in train loss.backward() File "/home/Python/dlprim_test/dlprim/lib/python3.12/site-packages/torch/_tensor.py", line 525, in backward torch.autograd.backward( File "/home/Python/dlprim_test/dlprim/lib/python3.12/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/home/Python/dlprim_test/dlprim/lib/python3.12/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Please register PrivateUse1HooksInterface by RegisterPrivateUse1HooksInterface first.


Building pytorch-dlprim ran successfully, however dlprimitives compilation failed with exit code 2 after having compiled most of the files. Unfortunately, I am not experienced enough in C++/Make/GCC (yet :P) to really figure out what is happening here: Make-output


But I was able to run the dlprim_flops test: dlprim_flops-output

xvim commented 4 months ago

also get the exactly same RuntimeError

root@kylin-desktop:/home/liucong/dlprimitives/build# make test
Running tests...
Test project /home/liucong/dlprimitives/build
      Start  1: test_test_case_abs
 1/33 Test  #1: test_test_case_abs ...............   Passed    1.23 sec
      Start  2: test_test_case_activation
 2/33 Test  #2: test_test_case_activation ........   Passed    4.38 sec
      Start  3: test_test_case_batchnorm
 3/33 Test  #3: test_test_case_batchnorm .........   Passed   15.26 sec
      Start  4: test_test_case_concat
 4/33 Test  #4: test_test_case_concat ............   Passed    0.55 sec
      Start  5: test_test_case_conv2d

^Cmake: *** [Makefile:71: test] 中断

It seems to be stuck in this position

# clinfo -l
Platform #0: AMD Accelerated Parallel Processing
 `-- Device #0: gfx1032
Platform #1: NVIDIA CUDA
 `-- Device #0: NVIDIA GeForce RTX 4060 Ti

(pytorch_cpu_env) root@kylin-desktop:/home/liucong/pytorch_dlprim# python mnist.py --device ocl:0
Using device: ocl:0
Accessing device #0:gfx1032 on AMD Accelerated Parallel Processing
Traceback (most recent call last):
  File "/home/liucong/pytorch_dlprim/mnist.py", line 162, in <module>
    main()
  File "/home/liucong/pytorch_dlprim/mnist.py", line 153, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/liucong/pytorch_dlprim/mnist.py", line 55, in train
    loss.backward()
  File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Please register PrivateUse1HooksInterface by `RegisterPrivateUse1HooksInterface` first.

(pytorch_cpu_env) root@kylin-desktop:/home/liucong/pytorch_dlprim# python mnist.py --device ocl:1
Using device: ocl:1
Accessing device #1:NVIDIA GeForce RTX 4060 Ti on NVIDIA CUDA
Traceback (most recent call last):
  File "/home/liucong/pytorch_dlprim/mnist.py", line 162, in <module>
    main()
  File "/home/liucong/pytorch_dlprim/mnist.py", line 153, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/liucong/pytorch_dlprim/mnist.py", line 55, in train
    loss.backward()
  File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Please register PrivateUse1HooksInterface by `RegisterPrivateUse1HooksInterface` first.
artyom-beilis commented 4 months ago

RuntimeError: Please register PrivateUse1HooksInterface by RegisterPrivateUse1HooksInterface first.

Related to this: https://github.com/artyom-beilis/pytorch_dlprim/issues/77

Some staff had changed in 2.3 please use pytorch 1.13 till I fix it