Closed masc-it closed 3 years ago
So, build and install are successful, but tests fail:
./dlprim_benchmark 0:0 ../docs/nets_for_benchmark/resnet18-b16.js
Output:
Using: AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0) on Clover
Error:Failed to build program source sgemm with parameters -DTILE_SIZE_M=64 -DTILE_SIZE_N=64 -DBLOCK_SIZE_M=4 -DBLOCK_SIZE_N=4 -DTILE_SIZE_K=16 -DTILE_OFFSET=0 -DBIAS=0 -DATRANS=0 -DBTRANS=1 -DIM2COL_OCHAN=12544 -DCONVGEMM=1 -DKERN_H=7 -DKERN_W=7 -DDILATE_H=1 -DDILATE_W=1 -DPAD_H=3 -DPAD_W=3 -DSTRIDE_H=2 -DSTRIDE_W=2 -DGROUPS=1 -DCHANNELS_IN=3 -DSRC_COLS=224 -DSRC_ROWS=224 -DIMG_COLS=112 -DIMG_ROWS=112 -DREDUCE_K=1 -DACTIVATION=0 log:
For device: AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
fatal error: cannot open file '/usr/lib/clc/gfx1010-amdgcn-mesa-mesa3d.bc': No such file or directory
Some driver related libraries should be missing, but I have already installed mesa...
Hi,
what is output of clinfo --list
(if not installed - apt install clinfo)
What drivers are you using you have 3 options:
What do you use?
I mean I see you use clover - this maybe an issue - because it clearly fails at opencl-driver level, I'd suggest try either rocm or amdgpu-pro
Hi,
what is output of
clinfo --list
(if not installed - apt install clinfo)What drivers are you using you have 3 options:
- amdgpu - with clover mesa (not 100% sure it supports RDNA - clover does not see my 6600xt)
- rocm
- amgpu-pro (it is based on rocm as far as I know)
What do you use?
Platform #0: Clover
`-- Device #0: AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
Platform #1: AMD Accelerated Parallel Processing
Is it possible to install only opencl (rocm) without the rocm full-stack? I had bad issues with it in the past.
I think you should be OK with AMDGPU-pro versions.
I see you do have: AMD Accelerated Parallel Processing but not clear which one. AMD GPU Pro comes with serveral varsion:
ii opencl-orca-amdgpu-pro-icd:amd64 21.30-1286092 amd64 non-free AMD OpenCL ICD Loaders
ii opencl-rocr-amdgpu-pro:amd64 21.30-1286092 amd64 ROCr OpenCL Runtime
First one is for older cards AFAIK (it runs my rx560m but not 6600xt) this opencl-rocr-amdgpu-pro
this one runs the 6600xt.
What opencl drivers have you installed?
Is it possible to install only opencl (rocm) without the rocm full-stack? I had bad issues with it in the past.
I don't really know, I had some issues with 6600xt with vulkan support but no issues with rx560.
This is the full clinfo output:
Also here you can notice the fatal error: cannot open file '/usr/lib/clc/gfx1010-amdgcn-mesa-mesa3d.bc': No such file or directory
, the same I got when running the benchmarks.
Number of platforms 2
Platform Name Clover
Platform Vendor Mesa
Platform Version OpenCL 1.1 Mesa 21.2.1
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd
Platform Extensions function suffix MESA
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.0 AMD-APP (3246.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback
Platform Extensions function suffix AMD
Platform Name Clover
Number of devices 1
Device Name AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 21.2.1
Driver Version 21.2.1
Device OpenCL C Version OpenCL C 1.1
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Max compute units 36
Max clock frequency 1780MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
=== CL_PROGRAM_BUILD_LOG ===
fatal error: cannot open file '/usr/lib/clc/gfx1010-amdgcn-mesa-mesa3d.bc': No such file or directory
Preferred work group size multiple <getWGsizes:1200: create kernel : error -46>
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 0 (n/a)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 6442450944 (6GiB)
Error Correction support No
Max memory allocation 5153960755 (4.8GiB)
Unified memory for Host and Device No
Minimum alignment for any data type 128 bytes
Alignment of base address 32768 bits (4096 bytes)
Global Memory cache type None
Image support No
Local memory type Local
Local memory size 32768 (32KiB)
Max number of constant args 16
Max constant buffer size 67108864 (64MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Profiling timer resolution 0ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_extended_versioning
Platform Name AMD Accelerated Parallel Processing
Number of devices 1
Device Name gfx1010:xnack+
Device Vendor Advanced Micro Devices, Inc.
Device Vendor ID 0x1002
Device Version OpenCL 2.0
Driver Version 3246.0 (HSA1.1,LC)
Device OpenCL C Version OpenCL C 2.0
Device Type GPU
Device Board Name (AMD) Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
Device Topology (AMD) PCI-E, 0a:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 18
SIMD per compute unit (AMD) 4
SIMD width (AMD) 32
SIMD instruction width (AMD) 1
Max clock frequency 1780MHz
Graphics IP (AMD) 10.1
Device Partition (core)
Max number of sub-devices 18
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 256
Preferred work group size (AMD) 256
Max work group size (AMD) 1024
Preferred work group size multiple 32
Wavefront width (AMD) 32
Preferred / native vector sizes
char 4 / 4
short 2 / 2
int 1 / 1
long 1 / 1
half 1 / 1 (cl_khr_fp16)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 6425673728 (5.984GiB)
Global free memory (AMD) 6275072 (5.984GiB)
Global memory channels (AMD) 6
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Error Correction support No
Max memory allocation 5461822664 (5.087GiB)
Unified memory for Host and Device No
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing Yes
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Preferred alignment for atomics
SVM 0 bytes
Global 0 bytes
Local 0 bytes
Max size for global variable 5461822664 (5.087GiB)
Preferred total size of global vars 6425673728 (5.984GiB)
Global Memory cache type Read/Write
Global Memory cache size 16384 (16KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 29471
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 8192 images
Base address alignment for 2D image buffers 256 bytes
Pitch alignment for 2D image buffers 256 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 16384x16384x8192 pixels
Max number of read image args 128
Max number of write image args 8
Max number of read/write image args 64
Max number of pipe args 16
Max active pipe reservations 16
Max pipe packet size 1166855368 (1.087GiB)
Local memory type Local
Local memory size 65536 (64KiB)
Local memory syze per CU (AMD) 65536 (64KiB)
Local memory banks (AMD) 32
Max number of constant args 8
Max constant buffer size 5461822664 (5.087GiB)
Preferred constant buffer size (AMD) 16384 (16KiB)
Max size of kernel argument 1024
Queue properties (on host)
Out-of-order execution No
Profiling Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 262144 (256KiB)
Max size 8388608 (8MiB)
Max queues on device 1
Max events on device 1024
Prefer user sync for interop Yes
Number of P2P devices (AMD) 0
P2P devices (AMD) <printDeviceInfo:147: get number of CL_DEVICE_P2P_DEVICES_AMD : error -30>
Profiling timer resolution 1ns
Profiling timer offset since Epoch (AMD) 0ns (Thu Jan 1 01:00:00 1970)
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Thread trace supported (AMD) No
Number of async queues (AMD) 8
Max real-time compute queues (AMD) 8
Max real-time compute units (AMD) 18
printf() buffer size 4194304 (4MiB)
Built-in kernels (n/a)
Device Extensions cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [MESA]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Clover
Device Name AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Clover
Device Name AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Clover
Device Name AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
Now run the benchmark on platform 1 device o
./dlprim_benchmark 1:0 ../docs/nets_for_benchmark/resnet18-b16.js
great, it seems to be working.
Just few notes on my machine:
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5 25.046
Step -4 22.358
Step -3 22.521
Step -2 22.393
Step -1 22.439
Step 0 22.407
Step 1 22.475
Step 2 22.391
Step 3 22.311
Step 4 22.256
Step 5 22.360
Step 6 22.405
Step 7 22.298
Step 8 22.450
Step 9 22.476
Step 10 22.367
Step 11 22.324
Step 12 22.383
Step 13 22.364
Step 14 22.261
Step 15 22.370
Step 16 22.477
Step 17 22.300
Step 18 22.372
Step 19 22.394
Time per sample: 1.398 ms
TOT time per batch: 22.372 ms
resnet50-b16 benchmark:
./dlprim_benchmark 1:0 ../docs/nets_for_benchmark/resnet50-b16.js
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5 106.212
Step -4 105.018
Step -3 104.995
Step -2 105.014
Step -1 104.974
Step 0 105.007
Step 1 105.024
Step 2 105.011
Step 3 104.891
Step 4 104.869
Step 5 104.952
Step 6 104.965
Step 7 105.018
Step 8 105.013
Step 9 104.935
Step 10 105.007
Step 11 104.943
Step 12 104.939
Step 13 105.009
Step 14 104.806
Step 15 104.904
Step 16 104.921
Step 17 104.814
Step 18 105.209
Step 19 104.840
Time per sample: 6.560 ms
TOT time per batch: 104.954 ms
./dlprim_benchmark 1:0 ../docs/nets_for_benchmark/vgg16-b16.js
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5 86.850
Step -4 85.730
Step -3 85.740
Step -2 86.006
Step -1 85.971
Step 0 86.006
Step 1 86.106
Step 2 86.018
Step 3 85.990
Step 4 86.043
Step 5 85.939
Step 6 86.034
Step 7 86.304
Step 8 86.118
Step 9 86.011
Step 10 85.971
Step 11 86.081
Step 12 86.041
Step 13 86.051
Step 14 86.255
Step 15 86.141
Step 16 86.113
Step 17 85.986
Step 18 85.970
Step 19 86.082
Time per sample: 5.379 ms
TOT time per batch: 86.063 ms
Cool now few comments
If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python
Also please run ./dlprim_bench 1:0 4
To see how well is it optimized for RDNA 1.
you can also add -b flag to test train times
will do soon. I have updated the previous comment with other benchmarks.
Cool now few comments
If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python
Also please run ./dlprim_bench 1:0 4
To see how well is it optimized for RDNA 1.
./dlprim_benchmark 1:0 4
Error:Failed to load json from 4, syntax error at line 1
./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/vgg16-b16.json
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5 353.770 86.045 267.725
Step -4 352.110 85.369 266.741
Step -3 352.927 85.225 267.702
Step -2 352.906 85.415 267.491
Step -1 350.841 85.374 265.468
Step 0 351.326 85.341 265.985
Step 1 352.796 85.370 267.426
Step 2 352.049 85.484 266.565
Step 3 352.823 85.602 267.221
Step 4 352.146 85.302 266.844
Step 5 351.188 85.428 265.760
Step 6 351.982 85.418 266.564
Step 7 352.618 85.845 266.773
Step 8 352.330 85.577 266.753
Step 9 352.272 85.588 266.684
Step 10 351.079 85.550 265.529
Step 11 352.096 85.725 266.371
Step 12 351.580 85.841 265.740
Step 13 352.896 85.731 267.165
Step 14 353.416 85.628 267.787
Step 15 353.156 85.588 267.568
Step 16 353.639 85.559 268.080
Step 17 352.277 85.589 266.688
Step 18 354.547 85.590 268.956
Step 19 353.859 85.491 268.367
Time per sample: 22.031 ms
FWD time per batch: 85.562 ms
BWD time per batch: 266.941 ms
TOT time per batch: 352.504 ms
./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/resnet50-b16.json
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5 379.391 125.199 254.192
Step -4 378.747 124.355 254.392
Step -3 377.416 124.103 253.313
Step -2 376.796 124.041 252.754
Step -1 378.070 124.030 254.040
Step 0 378.771 124.463 254.308
Step 1 378.742 124.271 254.471
Step 2 378.577 124.467 254.110
Step 3 377.015 124.178 252.837
Step 4 378.414 123.964 254.450
Step 5 378.971 124.217 254.754
Step 6 378.131 124.310 253.821
Step 7 379.306 124.186 255.121
Step 8 378.495 124.412 254.084
Step 9 378.294 124.175 254.119
Step 10 378.118 124.102 254.016
Step 11 379.100 124.544 254.555
Step 12 379.154 124.618 254.537
Step 13 378.063 124.044 254.019
Step 14 378.734 124.403 254.331
Step 15 378.873 124.323 254.550
Step 16 378.347 124.337 254.010
Step 17 379.239 124.589 254.650
Step 18 377.577 124.037 253.540
Step 19 378.980 124.305 254.675
Time per sample: 23.659 ms
FWD time per batch: 124.297 ms
BWD time per batch: 254.248 ms
TOT time per batch: 378.545 ms
32 batch, ./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/resnet50-b32.json
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (32,3,224,224)
Outputs
- loss: (32,1000)
Step -5 681.751 228.653 453.098
Step -4 679.952 226.915 453.037
Step -3 680.563 226.570 453.993
Step -2 678.810 226.432 452.379
Step -1 679.840 226.961 452.879
Step 0 682.178 226.847 455.331
Step 1 684.913 228.449 456.463
Step 2 680.624 226.812 453.811
Step 3 677.997 226.559 451.438
Step 4 679.704 226.519 453.185
Step 5 679.970 227.248 452.722
Step 6 680.107 226.908 453.199
Step 7 682.091 227.355 454.736
Step 8 681.425 226.812 454.613
Step 9 682.112 227.335 454.777
Step 10 680.621 227.558 453.062
Step 11 681.201 227.417 453.785
Step 12 681.783 227.400 454.384
Step 13 683.397 228.715 454.682
Step 14 681.637 227.780 453.857
Step 15 681.690 227.235 454.455
Step 16 685.070 228.631 456.439
Step 17 681.405 227.206 454.199
Step 18 679.156 226.926 452.230
Step 19 681.386 227.041 454.345
Time per sample: 21.294 ms
FWD time per batch: 227.338 ms
BWD time per batch: 454.086 ms
TOT time per batch: 681.423 ms
./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/resnet18-b16.json
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5 83.440 26.896 56.544
Step -4 81.696 25.063 56.633
Step -3 81.949 25.267 56.682
Step -2 81.829 25.062 56.767
Step -1 81.736 25.103 56.633
Step 0 81.865 25.095 56.770
Step 1 81.740 25.068 56.671
Step 2 81.858 25.112 56.745
Step 3 81.969 25.150 56.819
Step 4 81.859 25.122 56.736
Step 5 81.890 25.081 56.809
Step 6 82.018 25.195 56.823
Step 7 81.895 25.089 56.806
Step 8 81.932 25.187 56.745
Step 9 81.817 25.098 56.719
Step 10 81.853 25.117 56.736
Step 11 81.861 25.147 56.715
Step 12 82.163 25.303 56.860
Step 13 81.848 25.108 56.741
Step 14 81.933 25.162 56.771
Step 15 81.901 25.153 56.748
Step 16 81.852 25.158 56.693
Step 17 81.819 25.082 56.737
Step 18 82.170 25.309 56.861
Step 19 81.808 25.138 56.670
Time per sample: 5.119 ms
FWD time per batch: 25.144 ms
BWD time per batch: 56.759 ms
TOT time per batch: 81.902 ms
./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/mobilenet_v2-b16.json
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5 131.939 38.944 92.995
Step -4 129.773 37.242 92.531
Step -3 129.645 37.257 92.389
Step -2 129.004 37.209 91.794
Step -1 129.638 37.212 92.426
Step 0 129.607 37.224 92.383
Step 1 129.527 37.204 92.323
Step 2 129.705 37.244 92.461
Step 3 129.596 37.211 92.385
Step 4 129.936 37.212 92.724
Step 5 129.977 37.213 92.764
Step 6 129.850 37.282 92.568
Step 7 129.749 37.216 92.533
Step 8 130.371 37.281 93.090
Step 9 129.819 37.139 92.680
Step 10 129.452 37.140 92.312
Step 11 130.007 37.252 92.755
Step 12 129.380 37.209 92.171
Step 13 129.932 37.180 92.752
Step 14 129.602 37.217 92.385
Step 15 129.497 37.066 92.431
Step 16 129.894 37.230 92.664
Step 17 129.765 37.156 92.609
Step 18 129.481 37.159 92.322
Step 19 129.886 37.231 92.656
Time per sample: 8.109 ms
FWD time per batch: 37.203 ms
BWD time per batch: 92.548 ms
TOT time per batch: 129.752 ms
32 batch size, ./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/mobilenet_v2-b32.json
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (32,3,224,224)
Outputs
- loss: (32,1000)
Step -5 243.190 70.505 172.684
Step -4 241.889 69.351 172.537
Step -3 242.756 69.472 173.284
Step -2 241.985 69.558 172.426
Step -1 242.055 69.562 172.493
Step 0 242.479 69.397 173.082
Step 1 242.188 69.285 172.903
Step 2 242.308 69.254 173.053
Step 3 242.153 69.188 172.965
Step 4 242.305 69.187 173.118
Step 5 242.162 69.397 172.765
Step 6 241.815 69.275 172.540
Step 7 241.813 69.420 172.393
Step 8 242.101 69.418 172.683
Step 9 242.001 69.285 172.716
Step 10 242.652 69.407 173.244
Step 11 241.795 69.295 172.499
Step 12 242.038 69.334 172.704
Step 13 242.114 69.348 172.766
Step 14 242.507 69.528 172.980
Step 15 241.982 69.531 172.451
Step 16 242.181 69.355 172.827
Step 17 241.866 69.325 172.541
Step 18 242.328 69.446 172.882
Step 19 242.225 69.539 172.686
Time per sample: 7.567 ms
FWD time per batch: 69.361 ms
BWD time per batch: 172.790 ms
TOT time per batch: 242.151 ms
Cool now few comments If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python Also please run ./dlprim_bench 1:0 4 To see how well is it optimized for RDNA 1.
./dlprim_benchmark 1:0 4 Error:Failed to load json from 4, syntax error at line 1
My bad I meant: ./dlprim_flops 1:0 4
BTW, I put documentation online... not full yet but already useful
Cool now few comments If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python Also please run ./dlprim_bench 1:0 4 To see how well is it optimized for RDNA 1.
./dlprim_benchmark 1:0 4 Error:Failed to load json from 4, syntax error at line 1
My bad I meant:
./dlprim_flops 1:0 4
Testing on gfx1010:xnack+ on AMD Accelerated Parallel Processing
Testing memory speed
- Vector size 1
-- Warming
-- Running 279.363 GB/s
- Vector size 2
-- Warming
-- Running 293.186 GB/s
- Vector size 4
-- Warming
-- Running 291.768 GB/s
- Vector size 8
-- Warming
-- Running 295.808 GB/s
- Vector size 16
-- Warming
-- Running 289.269 GB/s
Testing flops float
- Vector size 1
-- Warming
-- Running 7300.63 GFlops
- Vector size 2
-- Warming
-- Running 7417.01 GFlops
- Vector size 4
-- Warming
-- Running 7376.02 GFlops
- Vector size 8
-- Warming
-- Running 7317.28 GFlops
- Vector size 16
-- Warming
-- Running 7292.22 GFlops
Testing flops half
- Vector size 1
-- Warming
-- Running 7354.08 GFlops
- Vector size 2
-- Warming
-- Running 14427.6 GFlops
- Vector size 4
-- Warming
-- Running 14234.9 GFlops
- Vector size 8
-- Warming
-- Running 14159.9 GFlops
- Vector size 16
-- Warming
-- Running 14278.7 GFlops
Summray for gfx1010:xnack+ on AMD Accelerated Parallel Processing
Peak GFlops for float 7417.01
Peak GFlops for half 14427.6
Peak memory 295.808 GB/s
GEMM
NN 0: 512, 512, 512 1974.2 GFlops (26.62%) 23.2 GB/s ( 8.01%) limited by gflops 26.62%
NN 1: 1024, 1024, 1024 3306.0 GFlops (44.57%) 19.4 GB/s ( 6.70%) limited by gflops 44.57%
NN 2: 1025, 1025, 1025 2739.2 GFlops (36.93%) 16.0 GB/s ( 5.55%) limited by gflops 36.93%
NN 3: 2048, 2048, 2048 3749.6 GFlops (50.55%) 11.0 GB/s ( 3.80%) limited by gflops 50.55%
NN 4: 2049, 2049, 2049 3379.1 GFlops (45.56%) 9.9 GB/s ( 3.42%) limited by gflops 45.56%
NN 5: 64, 2048, 64 896.4 GFlops (12.09%) 57.4 GB/s (19.83%) limited by memory 19.83%
NN 6: 2048, 64, 2048 1591.9 GFlops (21.46%) 52.9 GB/s (18.28%) limited by gflops 21.46%
NN 7: 2048, 2048, 64 1851.3 GFlops (24.96%) 62.0 GB/s (21.42%) limited by gflops 24.96%
NN 8: 2048, 64, 64 864.7 GFlops (11.66%) 55.3 GB/s (19.12%) limited by memory 19.12%
NN 9: 64, 2048, 2048 2236.7 GFlops (30.16%) 74.3 GB/s (25.68%) limited by gflops 30.16%
NN 10: 64, 64, 2048 483.4 GFlops ( 6.52%) 30.7 GB/s (10.61%) limited by memory 10.61%
NT 0: 512, 512, 512 1353.0 GFlops (18.24%) 15.9 GB/s ( 5.49%) limited by gflops 18.24%
NT 1: 1024, 1024, 1024 2149.1 GFlops (28.97%) 12.6 GB/s ( 4.36%) limited by gflops 28.97%
NT 2: 1025, 1025, 1025 2443.7 GFlops (32.95%) 14.3 GB/s ( 4.95%) limited by gflops 32.95%
NT 3: 2048, 2048, 2048 3312.4 GFlops (44.66%) 9.7 GB/s ( 3.36%) limited by gflops 44.66%
NT 4: 2049, 2049, 2049 3380.5 GFlops (45.58%) 9.9 GB/s ( 3.42%) limited by gflops 45.58%
NT 5: 64, 2048, 64 885.1 GFlops (11.93%) 56.6 GB/s (19.58%) limited by memory 19.58%
NT 6: 2048, 64, 2048 1431.0 GFlops (19.29%) 47.5 GB/s (16.43%) limited by gflops 19.29%
NT 7: 2048, 2048, 64 1831.0 GFlops (24.69%) 61.3 GB/s (21.18%) limited by gflops 24.69%
NT 8: 2048, 64, 64 873.2 GFlops (11.77%) 55.9 GB/s (19.31%) limited by memory 19.31%
NT 9: 64, 2048, 2048 1672.4 GFlops (22.55%) 55.5 GB/s (19.20%) limited by gflops 22.55%
NT 10: 64, 64, 2048 467.7 GFlops ( 6.31%) 29.7 GB/s (10.27%) limited by memory 10.27%
TN 0: 512, 512, 512 2108.0 GFlops (28.42%) 24.7 GB/s ( 8.55%) limited by gflops 28.42%
TN 1: 1024, 1024, 1024 3394.5 GFlops (45.77%) 19.9 GB/s ( 6.88%) limited by gflops 45.77%
TN 2: 1025, 1025, 1025 2666.2 GFlops (35.95%) 15.6 GB/s ( 5.40%) limited by gflops 35.95%
TN 3: 2048, 2048, 2048 3821.7 GFlops (51.53%) 11.2 GB/s ( 3.87%) limited by gflops 51.53%
TN 4: 2049, 2049, 2049 3381.2 GFlops (45.59%) 9.9 GB/s ( 3.42%) limited by gflops 45.59%
TN 5: 64, 2048, 64 846.8 GFlops (11.42%) 54.2 GB/s (18.73%) limited by memory 18.73%
TN 6: 2048, 64, 2048 2182.0 GFlops (29.42%) 72.5 GB/s (25.05%) limited by gflops 29.42%
TN 7: 2048, 2048, 64 1886.7 GFlops (25.44%) 63.1 GB/s (21.83%) limited by gflops 25.44%
TN 8: 2048, 64, 64 893.8 GFlops (12.05%) 57.2 GB/s (19.77%) limited by memory 19.77%
TN 9: 64, 2048, 2048 2274.6 GFlops (30.67%) 75.5 GB/s (26.11%) limited by gflops 30.67%
TN 10: 64, 64, 2048 468.2 GFlops ( 6.31%) 29.7 GB/s (10.28%) limited by memory 10.28%
TT 0: 512, 512, 512 1764.9 GFlops (23.79%) 20.7 GB/s ( 7.16%) limited by gflops 23.79%
TT 1: 1024, 1024, 1024 3318.9 GFlops (44.75%) 19.5 GB/s ( 6.73%) limited by gflops 44.75%
TT 2: 1025, 1025, 1025 2583.8 GFlops (34.84%) 15.1 GB/s ( 5.23%) limited by gflops 34.84%
TT 3: 2048, 2048, 2048 3683.3 GFlops (49.66%) 10.8 GB/s ( 3.73%) limited by gflops 49.66%
TT 4: 2049, 2049, 2049 3377.4 GFlops (45.54%) 9.9 GB/s ( 3.42%) limited by gflops 45.54%
TT 5: 64, 2048, 64 891.8 GFlops (12.02%) 57.1 GB/s (19.72%) limited by memory 19.72%
TT 6: 2048, 64, 2048 2177.6 GFlops (29.36%) 72.3 GB/s (25.00%) limited by gflops 29.36%
TT 7: 2048, 2048, 64 1877.0 GFlops (25.31%) 62.8 GB/s (21.71%) limited by gflops 25.31%
TT 8: 2048, 64, 64 892.6 GFlops (12.04%) 57.1 GB/s (19.74%) limited by memory 19.74%
TT 9: 64, 2048, 2048 1823.2 GFlops (24.58%) 60.6 GB/s (20.93%) limited by gflops 24.58%
TT 10: 64, 64, 2048 474.2 GFlops ( 6.39%) 30.1 GB/s (10.41%) limited by memory 10.41%
Convolution
0 alexnet forward b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 2553.4 GFlops (34.43%) 25.0 GB/s ( 8.65%) limited by gflops 34.43% algo=gemm
0 alexnet bwd-data b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 936.3 GFlops (12.62%) 9.2 GB/s ( 3.17%) limited by gflops 12.62% algo=gemm
0 alexnet bwd-filt b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 1645.9 GFlops (22.19%) 16.2 GB/s ( 5.58%) limited by gflops 22.19% algo=gemm
1 alexnet forward b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 2043.7 GFlops (27.55%) 5.2 GB/s ( 1.80%) limited by gflops 27.55% algo=gemm
1 alexnet bwd-data b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 935.0 GFlops (12.61%) 2.4 GB/s ( 0.82%) limited by gflops 12.61% algo=gemm
1 alexnet bwd-filt b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 1475.1 GFlops (19.89%) 3.8 GB/s ( 1.32%) limited by gflops 19.89% algo=gemm
2 alexnet forward b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 2707.0 GFlops (36.50%) 4.6 GB/s ( 1.60%) limited by gflops 36.50% algo=gemm
2 alexnet bwd-data b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 1004.3 GFlops (13.54%) 1.7 GB/s ( 0.59%) limited by gflops 13.54% algo=gemm
2 alexnet bwd-filt b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 1931.3 GFlops (26.04%) 3.4 GB/s ( 1.17%) limited by gflops 26.04% algo=gemm
3 alexnet forward b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 5807.3 GFlops (78.30%) 9.5 GB/s ( 3.28%) limited by gflops 78.30% algo=winograd
3 alexnet bwd-data b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 5359.8 GFlops (72.26%) 8.7 GB/s ( 3.02%) limited by gflops 72.26% algo=winograd
3 alexnet bwd-filt b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 5112.9 GFlops (68.94%) 9.3 GB/s ( 3.21%) limited by gflops 68.94% algo=winograd
4 resnet forward b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 2263.1 GFlops (30.51%) 36.6 GB/s (12.64%) limited by gflops 30.51% algo=gemm
4 resnet bwd-data b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 1183.1 GFlops (15.95%) 19.1 GB/s ( 6.61%) limited by gflops 15.95% algo=gemm
4 resnet bwd-filt b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 900.3 GFlops (12.14%) 14.5 GB/s ( 5.03%) limited by gflops 12.14% algo=gemm
5 resnet forward b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 863.3 GFlops (11.64%) 33.7 GB/s (11.66%) limited by memory 11.66% algo=gemm
5 resnet bwd-data b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 2492.5 GFlops (33.61%) 97.4 GB/s (33.67%) limited by memory 33.67% algo=gemm
5 resnet bwd-filt b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 1976.0 GFlops (26.64%) 77.2 GB/s (26.70%) limited by memory 26.70% algo=gemm
6 resnet forward b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 1991.7 GFlops (26.85%) 124.5 GB/s (43.04%) limited by memory 43.04% algo=gemm
6 resnet bwd-data b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 1752.7 GFlops (23.63%) 109.6 GB/s (37.87%) limited by memory 37.87% algo=gemm
6 resnet bwd-filt b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 683.0 GFlops ( 9.21%) 42.7 GB/s (14.76%) limited by memory 14.76% algo=gemm
7 resnet forward b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 5338.1 GFlops (71.97%) 37.1 GB/s (12.83%) limited by gflops 71.97% algo=winograd
7 resnet bwd-data b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 2750.2 GFlops (37.08%) 19.1 GB/s ( 6.61%) limited by gflops 37.08% algo=winograd
7 resnet bwd-filt b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 4882.9 GFlops (65.83%) 34.0 GB/s (11.76%) limited by gflops 65.83% algo=winograd
8 resnet forward b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 935.6 GFlops (12.61%) 6.1 GB/s ( 2.10%) limited by gflops 12.61% algo=gemm
8 resnet bwd-data b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 925.6 GFlops (12.48%) 6.0 GB/s ( 2.08%) limited by gflops 12.48% algo=gemm
8 resnet bwd-filt b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 914.9 GFlops (12.33%) 6.5 GB/s ( 2.26%) limited by gflops 12.33% algo=gemm
9 resnet forward b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 876.9 GFlops (11.82%) 8.7 GB/s ( 3.01%) limited by gflops 11.82% algo=gemm
9 resnet bwd-data b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 961.2 GFlops (12.96%) 9.5 GB/s ( 3.30%) limited by gflops 12.96% algo=gemm
9 resnet bwd-filt b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 701.1 GFlops ( 9.45%) 7.1 GB/s ( 2.44%) limited by gflops 9.45% algo=gemm
10 resnet forward b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 6244.5 GFlops (84.19%) 11.8 GB/s ( 4.09%) limited by gflops 84.19% algo=winograd
10 resnet bwd-data b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 5557.8 GFlops (74.93%) 10.5 GB/s ( 3.64%) limited by gflops 74.93% algo=winograd
10 resnet bwd-filt b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 4982.5 GFlops (67.18%) 10.2 GB/s ( 3.54%) limited by gflops 67.18% algo=winograd
11 vgg forward b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 965.6 GFlops (13.02%) 74.9 GB/s (25.89%) limited by memory 25.89% algo=gemm
11 vgg bwd-data b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 466.0 GFlops ( 6.28%) 36.1 GB/s (12.49%) limited by memory 12.49% algo=gemm
11 vgg bwd-filt b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 324.0 GFlops ( 4.37%) 25.1 GB/s ( 8.69%) limited by memory 8.69% algo=winograd
12 vgg forward b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 5419.6 GFlops (73.07%) 37.6 GB/s (13.01%) limited by gflops 73.07% algo=winograd
12 vgg bwd-data b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 2456.4 GFlops (33.12%) 17.1 GB/s ( 5.90%) limited by gflops 33.12% algo=winograd
12 vgg bwd-filt b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 3424.1 GFlops (46.17%) 23.8 GB/s ( 8.22%) limited by gflops 46.17% algo=winograd
13 vgg forward b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 6617.7 GFlops (89.22%) 6.0 GB/s ( 2.08%) limited by gflops 89.22% algo=winograd
13 vgg bwd-data b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 6938.4 GFlops (93.55%) 6.3 GB/s ( 2.18%) limited by gflops 93.55% algo=winograd
13 vgg bwd-filt b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 5956.2 GFlops (80.30%) 5.6 GB/s ( 1.95%) limited by gflops 80.30% algo=winograd
14 mobile forward b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 1184.4 GFlops (15.97%) 120.6 GB/s (41.70%) limited by memory 41.70% algo=gemm
14 mobile bwd-data b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 371.9 GFlops ( 5.01%) 37.9 GB/s (13.09%) limited by memory 13.09% algo=gemm
14 mobile bwd-filt b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 172.9 GFlops ( 2.33%) 17.6 GB/s ( 6.09%) limited by memory 6.09% algo=gemm
15 mobile forward b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 281.3 GFlops ( 3.79%) 125.0 GB/s (43.22%) limited by memory 43.22% algo=depthwise_separable
15 mobile bwd-data b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 50.3 GFlops ( 0.68%) 22.4 GB/s ( 7.73%) limited by memory 7.73% algo=depthwise_separable
15 mobile bwd-filt b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 77.0 GFlops ( 1.04%) 34.2 GB/s (11.83%) limited by memory 11.83% algo=depthwise_separable
16 mobile forward b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56 13.9 GFlops ( 0.19%) 15.4 GB/s ( 5.33%) limited by memory 5.33% algo=gemm
16 mobile bwd-data b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56 12.5 GFlops ( 0.17%) 13.9 GB/s ( 4.82%) limited by memory 4.82% algo=gemm
16 mobile bwd-filt b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56 24.9 GFlops ( 0.34%) 27.6 GB/s ( 9.55%) limited by memory 9.55% algo=gemm
17 mobile forward b=64 k=1 p=0 s=1 in=144 out=24 g=1 D=56 884.8 GFlops (11.93%) 86.0 GB/s (29.74%) limited by memory 29.74% algo=gemm
17 mobile bwd-data b=64 k=1 p=0 s=1 in=144 out=24 g=1 D=56 594.4 GFlops ( 8.01%) 57.8 GB/s (19.98%) limited by memory 19.98% algo=gemm
17 mobile bwd-filt b=64 k=1 p=0 s=1 in=144 out=24 g=1 D=56 861.2 GFlops (11.61%) 83.7 GB/s (28.95%) limited by memory 28.95% algo=gemm
18 mobile forward b=64 k=1 p=0 s=1 in=24 out=144 g=1 D=56 969.1 GFlops (13.07%) 94.2 GB/s (32.57%) limited by memory 32.57% algo=gemm
18 mobile bwd-data b=64 k=1 p=0 s=1 in=24 out=144 g=1 D=56 833.7 GFlops (11.24%) 81.1 GB/s (28.02%) limited by memory 28.02% algo=gemm
18 mobile bwd-filt b=64 k=1 p=0 s=1 in=24 out=144 g=1 D=56 867.8 GFlops (11.70%) 84.4 GB/s (29.17%) limited by memory 29.17% algo=gemm
19 mobile forward b=64 k=1 p=0 s=1 in=960 out=160 g=1 D=7 2012.2 GFlops (27.13%) 30.6 GB/s (10.59%) limited by gflops 27.13% algo=gemm
19 mobile bwd-data b=64 k=1 p=0 s=1 in=960 out=160 g=1 D=7 933.9 GFlops (12.59%) 14.2 GB/s ( 4.91%) limited by gflops 12.59% algo=gemm
19 mobile bwd-filt b=64 k=1 p=0 s=1 in=960 out=160 g=1 D=7 1677.5 GFlops (22.62%) 26.6 GB/s ( 9.20%) limited by gflops 22.62% algo=gemm
20 mobile forward b=64 k=1 p=0 s=1 in=960 out=320 g=1 D=7 747.6 GFlops (10.08%) 6.7 GB/s ( 2.32%) limited by gflops 10.08% algo=gemm
20 mobile bwd-data b=64 k=1 p=0 s=1 in=960 out=320 g=1 D=7 952.2 GFlops (12.84%) 8.5 GB/s ( 2.95%) limited by gflops 12.84% algo=gemm
20 mobile bwd-filt b=64 k=1 p=0 s=1 in=960 out=320 g=1 D=7 601.9 GFlops ( 8.11%) 5.8 GB/s ( 2.00%) limited by gflops 8.11% algo=gemm
21 mobile forward b=64 k=3 p=1 s=1 in=960 out=960 g=960 D=7 258.7 GFlops ( 3.49%) 115.1 GB/s (39.81%) limited by memory 39.81% algo=depthwise_separable
21 mobile bwd-data b=64 k=3 p=1 s=1 in=960 out=960 g=960 D=7 46.2 GFlops ( 0.62%) 20.5 GB/s ( 7.10%) limited by memory 7.10% algo=depthwise_separable
21 mobile bwd-filt b=64 k=3 p=1 s=1 in=960 out=960 g=960 D=7 43.5 GFlops ( 0.59%) 19.4 GB/s ( 6.70%) limited by memory 6.70% algo=depthwise_separable
22 scale forward b=64 k=1 p=0 s=1 in=256 out=256 g=256 D=56 47.5 GFlops ( 0.64%) 190.0 GB/s (65.70%) limited by memory 65.70% algo=depthwise_separable
22 scale bwd-data b=64 k=1 p=0 s=1 in=256 out=256 g=256 D=56 25.1 GFlops ( 0.34%) 100.5 GB/s (34.75%) limited by memory 34.75% algo=depthwise_separable
22 scale bwd-filt b=64 k=1 p=0 s=1 in=256 out=256 g=256 D=56 46.9 GFlops ( 0.63%) 187.6 GB/s (64.86%) limited by memory 64.86% algo=depthwise_separable
23 scale forward b=64 k=1 p=0 s=1 in=1024 out=1024 g=1024 D=7 42.0 GFlops ( 0.57%) 168.2 GB/s (58.14%) limited by memory 58.14% algo=depthwise_separable
23 scale bwd-data b=64 k=1 p=0 s=1 in=1024 out=1024 g=1024 D=7 24.3 GFlops ( 0.33%) 97.2 GB/s (33.60%) limited by memory 33.60% algo=depthwise_separable
23 scale bwd-filt b=64 k=1 p=0 s=1 in=1024 out=1024 g=1024 D=7 8.8 GFlops ( 0.12%) 35.3 GB/s (12.19%) limited by memory 12.19% algo=depthwise_separable
Cool now few comments
If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python
Also please run ./dlprim_bench 1:0 4
To see how well is it optimized for RDNA 1.
In order to make it recognize boost python and numpy, I had to slightly edit CMakelists.txt (L64-71)
find_package(PythonLibs 3)
find_package(Boost COMPONENTS python numpy)
if(PYTHONLIBS_FOUND AND Boost_NUMPY_FOUND AND Boost_PYTHON_FOUND)
set(BUILD_PYDLPRIM TRUE)
else()
set(BUILD_PYDLPRIM FALSE)
endif()
Build output seems just fine now:
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo
-- HDF5: Using hdf5 compiler wrapper to determine C configuration
=== Status ===
OpenCL: include /usr/include
lib /usr/lib/x86_64-linux-gnu/libOpenCL.so
Python: /usr/bin/python3
BLAS: include /usr/include/x86_64-linux-gnu
lib /usr/lib/x86_64-linux-gnu/libopenblas.so
HDF5: include /usr/include/hdf5/serial
lib /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so hdf5_cpp
Python dlprim: enabled
Python: lib /usr/lib/x86_64-linux-gnu/libpython3.8.so
include /usr/include/python3.8
Boost: include /usr/include
boost_numpy3
boost_python3
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/88D86BFED86BE940/Projects/opensource/dlprimitives/build
find_package(Boost COMPONENTS python numpy)
The problem with that - it finds boost python and numpy for python2 instead of python 3 that I expect at runtime.
See:
$ dpkg -L libboost-numpy1.65-dev | grep .so
/usr/lib/x86_64-linux-gnu/libboost_numpy-py27.so
/usr/lib/x86_64-linux-gnu/libboost_numpy.so
/usr/lib/x86_64-linux-gnu/libboost_numpy3-py36.so
/usr/lib/x86_64-linux-gnu/libboost_numpy3.so
Please check your installation of boost python/numpy and numpy in general under python3/pip3
My output is:
/usr/lib/x86_64-linux-gnu/libboost_numpy38.so
I don't think it is loading python2 deps, at least from the build output:
Boost: include /usr/include
boost_numpy3
boost_python3
Furthermore, I do not have any boost python2 related libs installed.
dpkg -L libboost-python1.71-dev | grep .so
/usr/lib/x86_64-linux-gnu/libboost_python38.so
And does it find boost_numpy3
? What is content of relevant parts of cache file grep Boost CMakeCache.txt
Can you please check latest changeset 3544774f06cde2c
I added several search strategies of appropriate boost python/numpy
Can you please check latest changeset 3544774
I added several search strategies of appropriate boost python/numpy
Have you checked if this changes solved the problem?
And does it find
boost_numpy3
? What is content of relevant parts of cache filegrep Boost CMakeCache.txt
grep Boost CMakeCache.txt
//The directory containing a CMake configuration file for Boost.
Boost_DIR:PATH=/usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0
Boost_INCLUDE_DIR:PATH=/usr/include
Boost_NUMPY_LIBRARY_RELEASE:STRING=/usr/lib/x86_64-linux-gnu/libboost_numpy38.so.1.71.0
Boost_PYTHON_LIBRARY_RELEASE:STRING=/usr/lib/x86_64-linux-gnu/libboost_python38.so.1.71.0
//ADVANCED property for variable: Boost_DIR
Boost_DIR-ADVANCED:INTERNAL=1
//Details about finding Boost
FIND_PACKAGE_MESSAGE_DETAILS_Boost:INTERNAL=[/usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake][cfound components: python numpy ][v1.71.0()]
Can you please check latest changeset 3544774 I added several search strategies of appropriate boost python/numpy
Have you checked if this changes solved the problem?
Trying it right now, let you know soon.
Can you please check latest changeset 3544774 I added several search strategies of appropriate boost python/numpy
Have you checked if this changes solved the problem?
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo
-- HDF5: Using hdf5 compiler wrapper to determine C configuration
-- Could NOT find Boost: missing: python3 numpy3 (found /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found version "1.71.0"))
=== Status ===
OpenCL: include /usr/include
lib /usr/lib/x86_64-linux-gnu/libOpenCL.so
Python: /usr/bin/python3
BLAS: include /usr/include/x86_64-linux-gnu
lib /usr/lib/x86_64-linux-gnu/libopenblas.so
HDF5: include /usr/include/hdf5/serial
lib /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so hdf5_cpp
Python dlprim: enabled
Python version: 38
Python: lib /usr/lib/x86_64-linux-gnu/libpython3.8.so
include /usr/include/python3.8
Boost: include /usr/include
boost_numpy Boost::numpy
boost_python Boost::python
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/88D86BFED86BE940/Projects/opensource/dlprimitives/build
Despite the fact it says Could NOT find Boost: missing: python3 numpy3 it successfully builds the python interface (when I run make):
[ 79%] Building CXX object CMakeFiles/pydlprim.dir/python/python_interface.cpp.o
[ 81%] Building CXX object CMakeFiles/test_net.dir/tests/test_net.cpp.o
[ 83%] Linking CXX executable test_random
[ 83%] Built target test_random
[ 84%] Linking CXX executable mnist
[ 86%] Linking CXX executable dlprim_flops
[ 86%] Built target dlprim_flops
[ 86%] Built target mnist
[ 88%] Linking CXX executable image_predict
[ 90%] Linking CXX executable train_mnist
[ 90%] Built target train_mnist
[ 90%] Built target image_predict
[ 92%] Linking CXX executable test_net
[ 94%] Linking CXX executable dlprim_benchmark
[ 94%] Built target test_net
[ 94%] Built target dlprim_benchmark
[ 96%] Linking CXX executable test_from_template
[ 98%] Linking CXX executable test_json
[ 98%] Built target test_from_template
[ 98%] Built target test_json
[100%] Linking CXX shared library python/dlprim/_pydlprim.so
[100%] Built target pydlprim
Because I search for boost_python3 and boost_python3x
Closing as looks like issue resolved
I am trying to follow the steps in BUILD.md, running Ubuntu 20.04 and I've installed the following deps (using apt) thus far (might help someone else in the future):
Of course I have python3 preinstalled.
Now, when I issue
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo
in the build folder, I just have this output, but no build is run whatsoever.As you can see it says it cannot find boost (python and numpy) but if I run
sudo apt install libboost-numpy-dev
or the python counterpart, it says I already have them installed.Any tips?
EDIT
Nevermind, I just had to run
sudo make install
aftwerwards. (imho BUILD.md should be updated)