Open ParadosBoy opened 8 months ago
Ok.
Two points, 1st lets check that dlprimiteves themselves work. It wasn't tested on this GPU type. Can you please build dlprimitives - run tests and benchmarks.
Also please post clinfo output.
Once we past it we'll go back to pytorch to figure out what is the issue. Which version of pytorch do you use?
sipeed@lpi4a:~$ clinfo
Number of platforms 1
Platform Name PowerVR
Platform Vendor Imagination Technologies
Platform Version OpenCL 3.0
Platform Profile EMBEDDED_PROFILE
Platform Extensions cl_khr_icd cl_khr_fp16 cl_img_spirv cles_khr_int64 cl_img_yuv_image cl_khr_device_uuid cl_khr_depth_images cl_khr_mipmap_image cl_khr_priority_hints cl_img_generate_mipmap cl_khr_3d_image_writes cl_img_cached_allocations cl_khr_mipmap_image_writes cl_khr_create_command_queue cl_khr_suggested_local_work_size cl_img_mem_properties cl_img_mem_properties_relax_alloc_requirements cl_khr_extended_versioning cl_khr_image2d_from_buffer cl_khr_byte_addressable_store cl_khr_local_int32_base_atomics cl_khr_global_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_global_int32_extended_atomics cl_khr_spir cl_khr_il_program cl_khr_egl_image cl_arm_import_memory cl_arm_import_memory_dma_buf cl_img_protected_content cl_img_semaphore cl_img_external_semaphore cl_img_external_semaphore_sync_fd cl_khr_semaphore cl_khr_external_semaphore cl_khr_external_semaphore_sync_fd
Platform Extensions with Version cl_khr_icd 0x400000 (1.0.0)
cl_khr_fp16 0x400000 (1.0.0)
cl_img_spirv 0x400000 (1.0.0)
cles_khr_int64 0x400000 (1.0.0)
cl_img_yuv_image 0x400000 (1.0.0)
cl_khr_device_uuid 0x400000 (1.0.0)
cl_khr_depth_images 0x400000 (1.0.0)
cl_khr_mipmap_image 0x400000 (1.0.0)
cl_khr_priority_hints 0x400000 (1.0.0)
cl_img_generate_mipmap 0x400000 (1.0.0)
cl_khr_3d_image_writes 0x400000 (1.0.0)
cl_img_cached_allocations 0x400000 (1.0.0)
cl_khr_mipmap_image_writes 0x400000 (1.0.0)
cl_khr_create_command_queue 0x400000 (1.0.0)
cl_khr_suggested_local_work_size 0x400000 (1.0.0)
cl_img_mem_properties 0x400000 (1.0.0)
cl_img_mem_properties_relax_alloc_requirements 0x400000 (1.0.0)
cl_khr_extended_versioning 0x400000 (1.0.0)
cl_khr_image2d_from_buffer 0x400000 (1.0.0)
cl_khr_byte_addressable_store 0x400000 (1.0.0)
cl_khr_local_int32_base_atomics 0x400000 (1.0.0)
cl_khr_global_int32_base_atomics 0x400000 (1.0.0)
cl_khr_local_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_global_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_spir 0x400000 (1.0.0)
cl_khr_il_program 0x400000 (1.0.0)
cl_khr_egl_image 0x400000 (1.0.0)
cl_arm_import_memory 0x400000 (1.0.0)
cl_arm_import_memory_dma_buf 0x400000 (1.0.0)
cl_img_protected_content 0x400000 (1.0.0)
cl_img_semaphore 0x400000 (1.0.0)
cl_img_external_semaphore 0x400000 (1.0.0)
cl_img_external_semaphore_sync_fd 0x400000 (1.0.0)
cl_khr_semaphore 0x400000 (1.0.0)
cl_khr_external_semaphore 0x400000 (1.0.0)
cl_khr_external_semaphore_sync_fd 0x400000 (1.0.0)
Platform Numeric Version 0xc00000 (3.0.0)
Platform Extensions function suffix IMG
Platform Host timer resolution 0ns
Platform Semaphore types Binary
Platform External semaphore import types <gatherPlatformInfo:12: get CL_PLATFORM_SEMAPHORE_IMPORT_HANDLE_TYPES_KHR size : error -30>
Platform External semaphore export types <gatherPlatformInfo:13: get CL_PLATFORM_SEMAPHORE_EXPORT_HANDLE_TYPES_KHR size : error -30>
Platform Name PowerVR
Number of devices 1
Device Name PowerVR B-Series BXM-4-64
Device Vendor Imagination Technologies
Device Vendor ID 0x1010
Device Version OpenCL 3.0
Device UUID 33362035-3220-3130-3420-313832000000
Driver UUID 36323130-3836-3600-0000-000000000000
Valid Device LUID No
Device LUID 0000-000000000000
Device Node Mask 0
Device Numeric Version 0xc00000 (3.0.0)
Driver Version 1.17@6210866
Device OpenCL C Version OpenCL C 1.2
Device OpenCL C Numeric Version 0x402000 (1.2.0)
Device OpenCL C all versions OpenCL C 0x400000 (1.0.0)
OpenCL C 0x401000 (1.1.0)
OpenCL C 0x402000 (1.2.0)
OpenCL C 0xc00000 (3.0.0)
Device OpenCL C features __opencl_c_int64 0x400000 (1.0.0)
__opencl_c_pipes 0xc00000 (3.0.0)
__opencl_c_images 0x400000 (1.0.0)
__opencl_c_subgroups 0xc00000 (3.0.0)
__opencl_c_3d_image_writes 0x400000 (1.0.0)
__opencl_c_read_write_images 0x400000 (1.0.0)
__opencl_c_generic_address_space 0xc00000 (3.0.0)
__opencl_c_program_scope_global_variables 0xc00000 (3.0.0)
__opencl_c_work_group_collective_functions 0xc00000 (3.0.0)
Latest conformance test passed v2021-10-04-00
Device Type GPU
Device Profile EMBEDDED_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 1
Max clock frequency 792MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple (device) 32
Preferred work group size multiple (kernel) 32
Max sub-groups per work group 512
Preferred / native vector sizes
char 16 / 1
short 8 / 1
int 4 / 1
long 2 / 1
half 0 / 0 (cl_khr_fp16)
float 4 / 1
double 0 / 0 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 64, Little-Endian
Semaphore types <printDeviceInfo:105: get number of CL_DEVICE_SEMAPHORE_TYPES_KHR : error -30>
External semaphore import types <printDeviceInfo:106: get number of CL_DEVICE_SEMAPHORE_IMPORT_HANDLE_TYPES_KHR : error -30>
External semaphore export types <printDeviceInfo:107: get number of CL_DEVICE_SEMAPHORE_EXPORT_HANDLE_TYPES_KHR : error -30>
Global memory size 16503758848 (15.37GiB)
Error Correction support No
Max memory allocation 4125939712 (3.843GiB)
Unified memory for Host and Device Yes
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing No
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Preferred alignment for atomics
SVM 0 bytes
Global 0 bytes
Local 0 bytes
Atomic memory capabilities relaxed, work-group scope
Atomic fence capabilities relaxed, acquire/release, work-group scope
Max size for global variable 16384 (16KiB)
Preferred total size of global vars 0
Global Memory cache type Read/Write
Global Memory cache size 16384 (16KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 16384 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 64 bytes
Pitch alignment for 2D image buffers 64 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 16384x16384x2048 pixels
Max number of read image args 8
Max number of write image args 64
Max number of read/write image args 64
Pipe support Yes
Max number of pipe args 16
Max active pipe reservations 1
Max pipe packet size 1024
Local memory type Local
Local memory size 4096 (4KiB)
Max number of constant args 256
Max constant buffer size 4125939712 (3.843GiB)
Generic address space support Yes
Max size of kernel argument 1024
Queue properties (on host)
Out-of-order execution Yes
Profiling Yes
Device enqueue capabilities (n/a)
Queue properties (on device)
Out-of-order execution No
Profiling No
Preferred size 0
Max size 0
Max queues on device 0
Max events on device 0
Prefer user sync for interop Yes
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
Non-uniform work-groups Yes
Work-group collective functions Yes
Sub-group independent forward progress No
IL version SPIR-V_1.2
ILs with version SPIR-V 0x402000 (1.2.0)
SPIR versions 1.2
printf() buffer size 65536 (64KiB)
Built-in kernels (n/a)
Built-in kernels with version (n/a)
Device Extensions cl_khr_icd cl_khr_fp16 cl_img_spirv cles_khr_int64 cl_img_yuv_image cl_khr_device_uuid cl_khr_depth_images cl_khr_mipmap_image cl_khr_priority_hints cl_img_generate_mipmap cl_khr_3d_image_writes cl_img_cached_allocations cl_khr_mipmap_image_writes cl_khr_create_command_queue cl_khr_suggested_local_work_size cl_img_mem_properties cl_img_mem_properties_relax_alloc_requirements cl_khr_extended_versioning cl_khr_image2d_from_buffer cl_khr_byte_addressable_store cl_khr_local_int32_base_atomics cl_khr_global_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_global_int32_extended_atomics cl_khr_spir cl_khr_il_program cl_khr_egl_image cl_arm_import_memory cl_arm_import_memory_dma_buf cl_img_protected_content cl_img_semaphore cl_img_external_semaphore cl_img_external_semaphore_sync_fd cl_khr_semaphore cl_khr_external_semaphore cl_khr_external_semaphore_sync_fd
Device Extensions with Version cl_khr_icd 0x400000 (1.0.0)
cl_khr_fp16 0x400000 (1.0.0)
cl_img_spirv 0x400000 (1.0.0)
cles_khr_int64 0x400000 (1.0.0)
cl_img_yuv_image 0x400000 (1.0.0)
cl_khr_device_uuid 0x400000 (1.0.0)
cl_khr_depth_images 0x400000 (1.0.0)
cl_khr_mipmap_image 0x400000 (1.0.0)
cl_khr_priority_hints 0x400000 (1.0.0)
cl_img_generate_mipmap 0x400000 (1.0.0)
cl_khr_3d_image_writes 0x400000 (1.0.0)
cl_img_cached_allocations 0x400000 (1.0.0)
cl_khr_mipmap_image_writes 0x400000 (1.0.0)
cl_khr_create_command_queue 0x400000 (1.0.0)
cl_khr_suggested_local_work_size 0x400000 (1.0.0)
cl_img_mem_properties 0x400000 (1.0.0)
cl_img_mem_properties_relax_alloc_requirements 0x400000 (1.0.0)
cl_khr_extended_versioning 0x400000 (1.0.0)
cl_khr_image2d_from_buffer 0x400000 (1.0.0)
cl_khr_byte_addressable_store 0x400000 (1.0.0)
cl_khr_local_int32_base_atomics 0x400000 (1.0.0)
cl_khr_global_int32_base_atomics 0x400000 (1.0.0)
cl_khr_local_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_global_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_spir 0x400000 (1.0.0)
cl_khr_il_program 0x400000 (1.0.0)
cl_khr_egl_image 0x400000 (1.0.0)
cl_arm_import_memory 0x400000 (1.0.0)
cl_arm_import_memory_dma_buf 0x400000 (1.0.0)
cl_img_protected_content 0x400000 (1.0.0)
cl_img_semaphore 0x400000 (1.0.0)
cl_img_external_semaphore 0x400000 (1.0.0)
cl_img_external_semaphore_sync_fd 0x400000 (1.0.0)
cl_khr_semaphore 0x400000 (1.0.0)
cl_khr_external_semaphore 0x400000 (1.0.0)
cl_khr_external_semaphore_sync_fd 0x400000 (1.0.0)
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) PowerVR
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [IMG]
clCreateContext(NULL, ...) [default] Success [IMG]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name PowerVR
Device Name PowerVR B-Series BXM-4-64
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name PowerVR
Device Name PowerVR B-Series BXM-4-64
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name PowerVR
Device Name PowerVR B-Series BXM-4-64
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.3.1
ICD loader Profile OpenCL 3.0
The cases in dlprimitives also don't seem to work
Traceback (most recent call last):
File "/home/sipeed/Desktop/dlprimitives/examples/python/mnist/train_mnist.py", line 8, in <module>
import dlprim as dp
ModuleNotFoundError: No module named 'dlprim'
Maybe my gpu is not suitable for this framework
Don't run python.
Run dlprim_flops
Something like:
./dlprim_flops 0:0 1.0
And run make test to see if tests pass (it may take some time)
sipeed@lpi4a:~/Desktop/dlprimitives/build$ ./dlprim_flops 0:0 1.0
Testing on PowerVR B-Series BXM-4-64 on PowerVR
Testing memory speed
- Vector size 1
-- Warming
-- Running 2.78029 GB/s
- Vector size 2
-- Warming
-- Running 4.53877 GB/s
- Vector size 4
-- Warming
-- Running 8.03852 GB/s
- Vector size 8
-- Warming
-- Running 4.89552 GB/s
- Vector size 16
-- Warming
-- Running 5.01402 GB/s
Testing flops float
- Vector size 1
-- Warming
-- Running 46.4071 GFlops
- Vector size 2
-- Warming
-- Running 42.1022 GFlops
- Vector size 4
-- Warming
-- Running 37.9592 GFlops
- Vector size 8
-- Warming
-- Running 36.6869 GFlops
- Vector size 16
-- Warming
-- Running 35.4454 GFlops
Testing flops half
- Vector size 1
-- Warming
-- Running 46.3838 GFlops
- Vector size 2
-- Warming
-- Running 44.5784 GFlops
- Vector size 4
-- Warming
-- Running 37.1824 GFlops
- Vector size 8
-- Warming
-- Running 37.9947 GFlops
- Vector size 16
-- Warming
-- Running 35.8126 GFlops
Summray for PowerVR B-Series BXM-4-64 on PowerVR
Peak GFlops for float 46.4071
Peak GFlops for half 46.3838
Peak memory 8.03852 GB/s
GEMM
NN 0: 512, 512, 512 9.3 GFlops (19.96%) 0.1 GB/s ( 2.17%) limited by gflops 19.96%
NN 1: 1024, 1024, 1024 9.4 GFlops (20.18%) 0.1 GB/s ( 1.09%) limited by gflops 20.18%
NN 2: 1025, 1025, 1025 8.4 GFlops (18.07%) 0.0 GB/s ( 0.98%) limited by gflops 18.07%
NN 3: 2048, 2048, 2048 8.8 GFlops (18.96%) 0.0 GB/s ( 0.51%) limited by gflops 18.96%
NN 4: 2049, 2049, 2049 8.4 GFlops (18.02%) 0.0 GB/s ( 0.49%) limited by gflops 18.02%
NN 5: 64, 2048, 64 4.9 GFlops (10.59%) 0.3 GB/s ( 6.27%) limited by gflops 10.59%
NN 6: 2048, 64, 2048 9.7 GFlops (20.91%) 0.3 GB/s ( 6.43%) limited by gflops 20.91%
NN 7: 2048, 2048, 64 6.6 GFlops (14.22%) 0.2 GB/s ( 4.40%) limited by gflops 14.22%
NN 8: 2048, 64, 64 5.2 GFlops (11.13%) 0.3 GB/s ( 6.59%) limited by gflops 11.13%
NN 9: 64, 2048, 2048 8.7 GFlops (18.80%) 0.3 GB/s ( 5.78%) limited by gflops 18.80%
NN 10: 64, 64, 2048 9.4 GFlops (20.27%) 0.6 GB/s (11.91%) limited by gflops 20.27%
NT 0: 512, 512, 512 11.9 GFlops (25.59%) 0.1 GB/s ( 2.78%) limited by gflops 25.59%
NT 1: 1024, 1024, 1024 12.5 GFlops (26.90%) 0.1 GB/s ( 1.46%) limited by gflops 26.90%
NT 2: 1025, 1025, 1025 10.4 GFlops (22.44%) 0.1 GB/s ( 1.22%) limited by gflops 22.44%
NT 3: 2048, 2048, 2048 12.0 GFlops (25.85%) 0.0 GB/s ( 0.70%) limited by gflops 25.85%
NT 4: 2049, 2049, 2049 10.7 GFlops (23.08%) 0.0 GB/s ( 0.63%) limited by gflops 23.08%
NT 5: 64, 2048, 64 5.7 GFlops (12.20%) 0.4 GB/s ( 7.22%) limited by gflops 12.20%
NT 6: 2048, 64, 2048 11.9 GFlops (25.73%) 0.4 GB/s ( 7.91%) limited by gflops 25.73%
NT 7: 2048, 2048, 64 8.1 GFlops (17.51%) 0.3 GB/s ( 5.42%) limited by gflops 17.51%
NT 8: 2048, 64, 64 5.7 GFlops (12.35%) 0.4 GB/s ( 7.31%) limited by gflops 12.35%
NT 9: 64, 2048, 2048 11.9 GFlops (25.66%) 0.4 GB/s ( 7.89%) limited by gflops 25.66%
NT 10: 64, 64, 2048 11.3 GFlops (24.42%) 0.7 GB/s (14.35%) limited by gflops 24.42%
TN 0: 512, 512, 512 7.2 GFlops (15.55%) 0.1 GB/s ( 1.69%) limited by gflops 15.55%
TN 1: 1024, 1024, 1024 7.3 GFlops (15.75%) 0.0 GB/s ( 0.85%) limited by gflops 15.75%
TN 2: 1025, 1025, 1025 6.7 GFlops (14.34%) 0.0 GB/s ( 0.78%) limited by gflops 14.34%
TN 3: 2048, 2048, 2048 7.2 GFlops (15.53%) 0.0 GB/s ( 0.42%) limited by gflops 15.53%
TN 4: 2049, 2049, 2049 7.0 GFlops (15.00%) 0.0 GB/s ( 0.41%) limited by gflops 15.00%
TN 5: 64, 2048, 64 4.4 GFlops ( 9.48%) 0.3 GB/s ( 5.62%) limited by gflops 9.48%
TN 6: 2048, 64, 2048 7.6 GFlops (16.34%) 0.3 GB/s ( 5.02%) limited by gflops 16.34%
TN 7: 2048, 2048, 64 5.5 GFlops (11.85%) 0.2 GB/s ( 3.67%) limited by gflops 11.85%
TN 8: 2048, 64, 64 4.5 GFlops ( 9.75%) 0.3 GB/s ( 5.78%) limited by gflops 9.75%
TN 9: 64, 2048, 2048 7.4 GFlops (15.97%) 0.2 GB/s ( 4.91%) limited by gflops 15.97%
TN 10: 64, 64, 2048 7.7 GFlops (16.51%) 0.5 GB/s ( 9.70%) limited by gflops 16.51%
TT 0: 512, 512, 512 9.5 GFlops (20.51%) 0.1 GB/s ( 2.23%) limited by gflops 20.51%
TT 1: 1024, 1024, 1024 9.6 GFlops (20.74%) 0.1 GB/s ( 1.13%) limited by gflops 20.74%
TT 2: 1025, 1025, 1025 8.5 GFlops (18.31%) 0.0 GB/s ( 0.99%) limited by gflops 18.31%
TT 3: 2048, 2048, 2048 8.9 GFlops (19.18%) 0.0 GB/s ( 0.52%) limited by gflops 19.18%
TT 4: 2049, 2049, 2049 8.5 GFlops (18.27%) 0.0 GB/s ( 0.50%) limited by gflops 18.27%
TT 5: 64, 2048, 64 5.1 GFlops (11.02%) 0.3 GB/s ( 6.52%) limited by gflops 11.02%
TT 6: 2048, 64, 2048 9.0 GFlops (19.50%) 0.3 GB/s ( 5.99%) limited by gflops 19.50%
TT 7: 2048, 2048, 64 6.6 GFlops (14.27%) 0.2 GB/s ( 4.42%) limited by gflops 14.27%
TT 8: 2048, 64, 64 5.1 GFlops (10.95%) 0.3 GB/s ( 6.49%) limited by gflops 10.95%
TT 9: 64, 2048, 2048 9.8 GFlops (21.12%) 0.3 GB/s ( 6.49%) limited by gflops 21.12%
TT 10: 64, 64, 2048 9.5 GFlops (20.42%) 0.6 GB/s (12.00%) limited by gflops 20.42%
Convolution
0 effnet forward b=64 k=3 p=1 s=1 in=480 out=480 g=480 D=14 1.6 GFlops ( 3.41%) 0.7 GB/s (14.02%)
limited by memory 14.02% algo=depthwise_separable
0 effnet bwd-data b=64 k=3 p=1 s=1 in=480 out=480 g=480 D=14 0.0 GFlops ( 0.07%) 0.0 GB/s ( 0.29%)
limited by memory 0.29% algo=depthwise_separable
0 effnet bwd-filt b=64 k=3 p=1 s=1 in=480 out=480 g=480 D=14 0.3 GFlops ( 0.68%) 0.1 GB/s ( 2.79%)
limited by memory 2.79% algo=depthwise_separable
1 alexnet forward b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 4.4 GFlops ( 9.48%) 0.0 GB/s ( 0.86%)
limited by gflops 9.48% algo=gemm
1 alexnet bwd-data b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 1.0 GFlops ( 2.15%) 0.0 GB/s ( 0.20%)
limited by gflops 2.15% algo=gemm
1 alexnet bwd-filt b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 3.4 GFlops ( 7.29%) 0.0 GB/s ( 0.66%)
limited by gflops 7.29% algo=gemm
2 alexnet forward b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 3.6 GFlops ( 7.66%) 0.0 GB/s ( 0.18%)
limited by gflops 7.66% algo=gemm
2 alexnet bwd-data b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 1.2 GFlops ( 2.57%) 0.0 GB/s ( 0.06%)
limited by gflops 2.57% algo=gemm
2 alexnet bwd-filt b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 3.0 GFlops ( 6.41%) 0.0 GB/s ( 0.15%)
limited by gflops 6.41% algo=gemm
3 alexnet forward b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 5.1 GFlops (11.02%) 0.0 GB/s ( 0.17%)
limited by gflops 11.02% algo=gemm
3 alexnet bwd-data b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 1.6 GFlops ( 3.43%) 0.0 GB/s ( 0.05%)
limited by gflops 3.43% algo=gemm
3 alexnet bwd-filt b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 3.9 GFlops ( 8.45%) 0.0 GB/s ( 0.14%)
limited by gflops 8.45% algo=gemm
4 alexnet forward b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 5.1 GFlops (11.08%) 0.0 GB/s ( 0.17%)
limited by gflops 11.08% algo=gemm
4 alexnet bwd-data b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 1.1 GFlops ( 2.29%) 0.0 GB/s ( 0.03%)
limited by gflops 2.29% algo=gemm
4 alexnet bwd-filt b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 4.1 GFlops ( 8.74%) 0.0 GB/s ( 0.15%)
limited by gflops 8.74% algo=gemm
5 resnet forward b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 3.5 GFlops ( 7.63%) 0.1 GB/s ( 1.14%)
limited by gflops 7.63% algo=gemm
5 resnet bwd-data b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 1.0 GFlops ( 2.17%) 0.0 GB/s ( 0.32%)
limited by gflops 2.17% algo=gemm
5 resnet bwd-filt b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 3.0 GFlops ( 6.39%) 0.0 GB/s ( 0.96%)
limited by gflops 6.39% algo=gemm
6 resnet forward b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 5.7 GFlops (12.31%) 0.2 GB/s ( 4.45%)
limited by gflops 12.31% algo=gemm
6 resnet bwd-data b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 5.6 GFlops (12.11%) 0.2 GB/s ( 4.38%)
limited by gflops 12.11% algo=gemm
6 resnet bwd-filt b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 7.0 GFlops (15.11%) 0.3 GB/s ( 5.47%)
limited by gflops 15.11% algo=gemm
7 resnet forward b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 5.8 GFlops (12.55%) 0.4 GB/s ( 7.26%)
limited by gflops 12.55% algo=gemm
7 resnet bwd-data b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 3.5 GFlops ( 7.51%) 0.2 GB/s ( 4.34%)
limited by gflops 7.51% algo=gemm
7 resnet bwd-filt b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 7.0 GFlops (15.09%) 0.4 GB/s ( 8.73%)
limited by gflops 15.09% algo=gemm
8 resnet forward b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 4.7 GFlops (10.15%) 0.0 GB/s ( 0.65%)
limited by gflops 10.15% algo=gemm
8 resnet bwd-data b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 1.0 GFlops ( 2.13%) 0.0 GB/s ( 0.14%)
limited by gflops 2.13% algo=gemm
8 resnet bwd-filt b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 3.9 GFlops ( 8.48%) 0.0 GB/s ( 0.55%)
limited by gflops 8.48% algo=gemm
9 resnet forward b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 7.8 GFlops (16.73%) 0.1 GB/s ( 1.01%)
limited by gflops 16.73% algo=gemm
9 resnet bwd-data b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 4.3 GFlops ( 9.27%) 0.0 GB/s ( 0.56%)
limited by gflops 9.27% algo=gemm
9 resnet bwd-filt b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 3.7 GFlops ( 8.01%) 0.0 GB/s ( 0.53%)
limited by gflops 8.01% algo=gemm
10 resnet forward b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 8.3 GFlops (17.98%) 0.1 GB/s ( 1.65%)
limited by gflops 17.98% algo=gemm
10 resnet bwd-data b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 5.7 GFlops (12.29%) 0.1 GB/s ( 1.13%)
limited by gflops 12.29% algo=gemm
10 resnet bwd-filt b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 7.1 GFlops (15.31%) 0.1 GB/s ( 1.43%)
limited by gflops 15.31% algo=gemm
11 resnet forward b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 5.2 GFlops (11.11%) 0.0 GB/s ( 0.19%)
limited by gflops 11.11% algo=gemm
11 resnet bwd-data b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 1.0 GFlops ( 2.15%) 0.0 GB/s ( 0.04%)
limited by gflops 2.15% algo=gemm
11 resnet bwd-filt b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 4.0 GFlops ( 8.53%) 0.0 GB/s ( 0.16%)
limited by gflops 8.53% algo=gemm
12 vgg forward b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 2.1 GFlops ( 4.59%) 0.2 GB/s ( 3.29%)
limited by gflops 4.59% algo=gemm
12 vgg bwd-data b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 1.0 GFlops ( 2.13%) 0.1 GB/s ( 1.53%)
limited by gflops 2.13% algo=gemm
12 vgg bwd-filt b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 1.7 GFlops ( 3.60%) 0.1 GB/s ( 2.58%)
limited by gflops 3.60% algo=gemm
13 vgg forward b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 13.2 GFlops (28.35%) 0.1 GB/s ( 1.82%)
limited by gflops 28.35% algo=gemm
13 vgg bwd-data b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 13.2 GFlops (28.35%) 0.1 GB/s ( 1.82%)
limited by gflops 28.35% algo=gemm
13 vgg bwd-filt b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 13.2 GFlops (28.35%) 0.1 GB/s ( 1.82%)
limited by gflops 28.35% algo=gemm
14 vgg forward b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 13.2 GFlops (28.35%) 0.0 GB/s ( 0.24%)
limited by gflops 28.35% algo=gemm
14 vgg bwd-data b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 13.2 GFlops (28.35%) 0.0 GB/s ( 0.24%)
limited by gflops 28.35% algo=gemm
14 vgg bwd-filt b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 13.2 GFlops (28.35%) 0.0 GB/s ( 0.25%)
limited by gflops 28.35% algo=gemm
15 mobile forward b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 1.2 GFlops ( 2.54%) 0.1 GB/s ( 2.39%)
limited by gflops 2.54% algo=gemm
15 mobile bwd-data b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 0.3 GFlops ( 0.73%) 0.0 GB/s ( 0.69%)
limited by gflops 0.73% algo=gemm
15 mobile bwd-filt b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 0.9 GFlops ( 1.88%) 0.1 GB/s ( 1.77%)
limited by gflops 1.88% algo=gemm
16 mobile forward b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 1.5 GFlops ( 3.34%) 0.7 GB/s (13.73%)
limited by memory 13.73% algo=depthwise_separable
16 mobile bwd-data b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 0.1 GFlops ( 0.32%) 0.1 GB/s ( 1.30%)
limited by memory 1.30% algo=depthwise_separable
16 mobile bwd-filt b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 0.2 GFlops ( 0.42%) 0.1 GB/s ( 1.74%)
limited by memory 1.74% algo=depthwise_separable
17 mobile forward b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56 0.1 GFlops ( 0.16%) 0.1 GB/s ( 1.60%)
limited by memory 1.60% algo=gemm
17 mobile bwd-data b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56
0.0 GFlops ( 0.08%) 0.0 GB/s ( 0.80%) limited by memory 0.80% algo=gemm
17 mobile bwd-filt b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56
It seems to be stuck in this position.
46GFlops is very few - it is a very weak. What is the platform are you running? Is it something ARM based? Because even old Intel CPU chip has much better performance and more GFlops.
So it isn't clear what benefit of using GPU you will have.
emmm... This is a GPU designed for risc-v architecture, just for testing compatibility
Actually, I'm getting the exactly same RuntimeError as the initial post.
Using device: ocl:0 Accessing device #0:AMD Radeon RX 570 Series (radeonsi, polaris10, LLVM 17.0.6, DRM 3.57, 6.8.7-artix1-2) on Clover Traceback (most recent call last): File "/home/Python/dlprim_test/pytorch_dlprim/mnist.py", line 162, in
main() File "/home/Python/dlprim_test/pytorch_dlprim/mnist.py", line 153, in main train(args, model, device, train_loader, optimizer, epoch) File "/home/Python/dlprim_test/pytorch_dlprim/mnist.py", line 55, in train loss.backward() File "/home/Python/dlprim_test/dlprim/lib/python3.12/site-packages/torch/_tensor.py", line 525, in backward torch.autograd.backward( File "/home/Python/dlprim_test/dlprim/lib/python3.12/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/home/Python/dlprim_test/dlprim/lib/python3.12/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Please register PrivateUse1HooksInterface by RegisterPrivateUse1HooksInterface
first.
Building pytorch-dlprim ran successfully, however dlprimitives compilation failed with exit code 2 after having compiled most of the files. Unfortunately, I am not experienced enough in C++/Make/GCC (yet :P) to really figure out what is happening here: Make-output
But I was able to run the dlprim_flops test: dlprim_flops-output
also get the exactly same RuntimeError
root@kylin-desktop:/home/liucong/dlprimitives/build# make test
Running tests...
Test project /home/liucong/dlprimitives/build
Start 1: test_test_case_abs
1/33 Test #1: test_test_case_abs ............... Passed 1.23 sec
Start 2: test_test_case_activation
2/33 Test #2: test_test_case_activation ........ Passed 4.38 sec
Start 3: test_test_case_batchnorm
3/33 Test #3: test_test_case_batchnorm ......... Passed 15.26 sec
Start 4: test_test_case_concat
4/33 Test #4: test_test_case_concat ............ Passed 0.55 sec
Start 5: test_test_case_conv2d
^Cmake: *** [Makefile:71: test] 中断
It seems to be stuck in this position
# clinfo -l
Platform #0: AMD Accelerated Parallel Processing
`-- Device #0: gfx1032
Platform #1: NVIDIA CUDA
`-- Device #0: NVIDIA GeForce RTX 4060 Ti
(pytorch_cpu_env) root@kylin-desktop:/home/liucong/pytorch_dlprim# python mnist.py --device ocl:0
Using device: ocl:0
Accessing device #0:gfx1032 on AMD Accelerated Parallel Processing
Traceback (most recent call last):
File "/home/liucong/pytorch_dlprim/mnist.py", line 162, in <module>
main()
File "/home/liucong/pytorch_dlprim/mnist.py", line 153, in main
train(args, model, device, train_loader, optimizer, epoch)
File "/home/liucong/pytorch_dlprim/mnist.py", line 55, in train
loss.backward()
File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Please register PrivateUse1HooksInterface by `RegisterPrivateUse1HooksInterface` first.
(pytorch_cpu_env) root@kylin-desktop:/home/liucong/pytorch_dlprim# python mnist.py --device ocl:1
Using device: ocl:1
Accessing device #1:NVIDIA GeForce RTX 4060 Ti on NVIDIA CUDA
Traceback (most recent call last):
File "/home/liucong/pytorch_dlprim/mnist.py", line 162, in <module>
main()
File "/home/liucong/pytorch_dlprim/mnist.py", line 153, in main
train(args, model, device, train_loader, optimizer, epoch)
File "/home/liucong/pytorch_dlprim/mnist.py", line 55, in train
loss.backward()
File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/home/liucong/pytorch_cpu_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Please register PrivateUse1HooksInterface by `RegisterPrivateUse1HooksInterface` first.
RuntimeError: Please register PrivateUse1HooksInterface by
RegisterPrivateUse1HooksInterface
first.
Related to this: https://github.com/artyom-beilis/pytorch_dlprim/issues/77
Some staff had changed in 2.3 please use pytorch 1.13 till I fix it
I'm not sure if it's a gpu problem or a pytorch version compatibility problem This is my configuration information
Did I miss some required dependencies?