artyom-beilis / dlprimitives

Deep Learning Primitives and Mini-Framework for OpenCL
http://blog.dlprimitives.org/
MIT License
169 stars 16 forks source link

Building dlprimitives #3

Closed masc-it closed 3 years ago

masc-it commented 3 years ago

I am trying to follow the steps in BUILD.md, running Ubuntu 20.04 and I've installed the following deps (using apt) thus far (might help someone else in the future):

Of course I have python3 preinstalled.

Now, when I issue cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo in the build folder, I just have this output, but no build is run whatsoever.

cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- HDF5: Using hdf5 compiler wrapper to determine C configuration
-- Found HDF5: /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.10.4")  
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.8.so (found suitable version "3.8.10", minimum required is "3") 
-- Could NOT find Boost: missing: python3 numpy3 (found /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found version "1.71.0"))
=== Status ===
  OpenCL: include /usr/include
          lib     /usr/lib/x86_64-linux-gnu/libOpenCL.so
  Python: /usr/bin/python3
  BLAS: include /usr/include/x86_64-linux-gnu
        lib /usr/lib/x86_64-linux-gnu/libopenblas.so
  HDF5: include /usr/include/hdf5/serial
        lib  /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so hdf5_cpp
  Python dlprim: disabled
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/88D86BFED86BE940/Projects/opensource/dlprimitives/build

As you can see it says it cannot find boost (python and numpy) but if I run sudo apt install libboost-numpy-dev or the python counterpart, it says I already have them installed.

Any tips?

EDIT

Nevermind, I just had to run sudo make install aftwerwards. (imho BUILD.md should be updated)

masc-it commented 3 years ago

So, build and install are successful, but tests fail:

./dlprim_benchmark 0:0 ../docs/nets_for_benchmark/resnet18-b16.js

Output:

Using: AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0) on Clover
Error:Failed to build program source sgemm with parameters -DTILE_SIZE_M=64 -DTILE_SIZE_N=64 -DBLOCK_SIZE_M=4 -DBLOCK_SIZE_N=4 -DTILE_SIZE_K=16 -DTILE_OFFSET=0 -DBIAS=0 -DATRANS=0 -DBTRANS=1 -DIM2COL_OCHAN=12544 -DCONVGEMM=1 -DKERN_H=7 -DKERN_W=7 -DDILATE_H=1 -DDILATE_W=1 -DPAD_H=3 -DPAD_W=3 -DSTRIDE_H=2 -DSTRIDE_W=2 -DGROUPS=1 -DCHANNELS_IN=3 -DSRC_COLS=224 -DSRC_ROWS=224 -DIMG_COLS=112 -DIMG_ROWS=112 -DREDUCE_K=1 -DACTIVATION=0 log:
For device: AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
fatal error: cannot open file '/usr/lib/clc/gfx1010-amdgcn-mesa-mesa3d.bc': No such file or directory

Some driver related libraries should be missing, but I have already installed mesa...

artyom-beilis commented 3 years ago

Hi,

what is output of clinfo --list (if not installed - apt install clinfo)

What drivers are you using you have 3 options:

What do you use?

artyom-beilis commented 3 years ago

I mean I see you use clover - this maybe an issue - because it clearly fails at opencl-driver level, I'd suggest try either rocm or amdgpu-pro

masc-it commented 3 years ago

Hi,

what is output of clinfo --list (if not installed - apt install clinfo)

What drivers are you using you have 3 options:

  • amdgpu - with clover mesa (not 100% sure it supports RDNA - clover does not see my 6600xt)
  • rocm
  • amgpu-pro (it is based on rocm as far as I know)

What do you use?

Platform #0: Clover
 `-- Device #0: AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
Platform #1: AMD Accelerated Parallel Processing

Is it possible to install only opencl (rocm) without the rocm full-stack? I had bad issues with it in the past.

artyom-beilis commented 3 years ago

I think you should be OK with AMDGPU-pro versions.

I see you do have: AMD Accelerated Parallel Processing but not clear which one. AMD GPU Pro comes with serveral varsion:

ii  opencl-orca-amdgpu-pro-icd:amd64                            21.30-1286092                                       amd64        non-free AMD OpenCL ICD Loaders
ii  opencl-rocr-amdgpu-pro:amd64                                21.30-1286092                                       amd64        ROCr OpenCL Runtime

First one is for older cards AFAIK (it runs my rx560m but not 6600xt) this opencl-rocr-amdgpu-pro this one runs the 6600xt.

What opencl drivers have you installed?

Is it possible to install only opencl (rocm) without the rocm full-stack? I had bad issues with it in the past.

I don't really know, I had some issues with 6600xt with vulkan support but no issues with rx560.

masc-it commented 3 years ago

This is the full clinfo output:

Also here you can notice the fatal error: cannot open file '/usr/lib/clc/gfx1010-amdgcn-mesa-mesa3d.bc': No such file or directory, the same I got when running the benchmarks.

Number of platforms                               2
  Platform Name                                   Clover
  Platform Vendor                                 Mesa
  Platform Version                                OpenCL 1.1 Mesa 21.2.1
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             MESA

  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3246.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback 
  Platform Extensions function suffix             AMD

  Platform Name                                   Clover
Number of devices                                 1
  Device Name                                     AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
  Device Vendor                                   AMD
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.1 Mesa 21.2.1
  Driver Version                                  21.2.1
  Device OpenCL C Version                         OpenCL C 1.1 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Max compute units                               36
  Max clock frequency                             1780MHz
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
=== CL_PROGRAM_BUILD_LOG ===
fatal error: cannot open file '/usr/lib/clc/gfx1010-amdgcn-mesa-mesa3d.bc': No such file or directory
  Preferred work group size multiple              <getWGsizes:1200: create kernel : error -46>
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 0        (n/a)
    float                                                4 / 4       
    double                                               2 / 2        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              6442450944 (6GiB)
  Error Correction support                        No
  Max memory allocation                           5153960755 (4.8GiB)
  Unified memory for Host and Device              No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       32768 bits (4096 bytes)
  Global Memory cache type                        None
  Image support                                   No
  Local memory type                               Local
  Local memory size                               32768 (32KiB)
  Max number of constant args                     16
  Max constant buffer size                        67108864 (64MiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Profiling timer resolution                      0ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_extended_versioning

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx1010:xnack+
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 2.0 
  Driver Version                                  3246.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
  Device Topology (AMD)                           PCI-E, 0a:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               18
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                32
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1780MHz
  Graphics IP (AMD)                               10.1
  Device Partition                                (core)
    Max number of sub-devices                     18
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple              32
  Wavefront width (AMD)                           32
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                2 / 2       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              6425673728 (5.984GiB)
  Global free memory (AMD)                        6275072 (5.984GiB)
  Global memory channels (AMD)                    6
  Global memory banks per channel (AMD)           4
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           5461822664 (5.087GiB)
  Unified memory for Host and Device              No
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    5461822664 (5.087GiB)
  Preferred total size of global vars             6425673728 (5.984GiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             29471
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 8192 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             16384x16384x8192 pixels
    Max number of read image args                 128
    Max number of write image args                8
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    16
  Max pipe packet size                            1166855368 (1.087GiB)
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Local memory syze per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max number of constant args                     8
  Max constant buffer size                        5461822664 (5.087GiB)
  Preferred constant buffer size (AMD)            16384 (16KiB)
  Max size of kernel argument                     1024
  Queue properties (on host)                      
    Out-of-order execution                        No
    Profiling                                     Yes
  Queue properties (on device)                    
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                262144 (256KiB)
    Max size                                      8388608 (8MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    Yes
  Number of P2P devices (AMD)                     0
  P2P devices (AMD)                               <printDeviceInfo:147: get number of CL_DEVICE_P2P_DEVICES_AMD : error -30>
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        0ns (Thu Jan  1 01:00:00 1970)
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  No
    Number of async queues (AMD)                  8
    Max real-time compute queues (AMD)            8
    Max real-time compute units (AMD)             18
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [MESA]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Clover
    Device Name                                   AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Clover
    Device Name                                   AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Clover
    Device Name                                   AMD Radeon RX 5600 XT (NAVI10, DRM 3.40.0, 5.11.0-34-generic, LLVM 12.0.0)
artyom-beilis commented 3 years ago

Now run the benchmark on platform 1 device o

./dlprim_benchmark 1:0 ../docs/nets_for_benchmark/resnet18-b16.js
masc-it commented 3 years ago

great, it seems to be working.

Just few notes on my machine:

Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5     25.046
Step -4     22.358
Step -3     22.521
Step -2     22.393
Step -1     22.439
Step  0     22.407
Step  1     22.475
Step  2     22.391
Step  3     22.311
Step  4     22.256
Step  5     22.360
Step  6     22.405
Step  7     22.298
Step  8     22.450
Step  9     22.476
Step 10     22.367
Step 11     22.324
Step 12     22.383
Step 13     22.364
Step 14     22.261
Step 15     22.370
Step 16     22.477
Step 17     22.300
Step 18     22.372
Step 19     22.394
Time per sample: 1.398 ms
TOT time per batch:  22.372 ms

resnet50-b16 benchmark:

./dlprim_benchmark 1:0 ../docs/nets_for_benchmark/resnet50-b16.js
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5    106.212
Step -4    105.018
Step -3    104.995
Step -2    105.014
Step -1    104.974
Step  0    105.007
Step  1    105.024
Step  2    105.011
Step  3    104.891
Step  4    104.869
Step  5    104.952
Step  6    104.965
Step  7    105.018
Step  8    105.013
Step  9    104.935
Step 10    105.007
Step 11    104.943
Step 12    104.939
Step 13    105.009
Step 14    104.806
Step 15    104.904
Step 16    104.921
Step 17    104.814
Step 18    105.209
Step 19    104.840
Time per sample: 6.560 ms
TOT time per batch:  104.954 ms
./dlprim_benchmark 1:0 ../docs/nets_for_benchmark/vgg16-b16.js
Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5     86.850
Step -4     85.730
Step -3     85.740
Step -2     86.006
Step -1     85.971
Step  0     86.006
Step  1     86.106
Step  2     86.018
Step  3     85.990
Step  4     86.043
Step  5     85.939
Step  6     86.034
Step  7     86.304
Step  8     86.118
Step  9     86.011
Step 10     85.971
Step 11     86.081
Step 12     86.041
Step 13     86.051
Step 14     86.255
Step 15     86.141
Step 16     86.113
Step 17     85.986
Step 18     85.970
Step 19     86.082
Time per sample: 5.379 ms
TOT time per batch:  86.063 ms
artyom-beilis commented 3 years ago

Cool now few comments

If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python

Also please run ./dlprim_bench 1:0 4

To see how well is it optimized for RDNA 1.

artyom-beilis commented 3 years ago

you can also add -b flag to test train times

masc-it commented 3 years ago

will do soon. I have updated the previous comment with other benchmarks.

masc-it commented 3 years ago

Cool now few comments

If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python

Also please run ./dlprim_bench 1:0 4

To see how well is it optimized for RDNA 1.

./dlprim_benchmark 1:0 4

Error:Failed to load json from 4, syntax error at line 1
masc-it commented 3 years ago

./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/vgg16-b16.json

Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5    353.770    86.045   267.725
Step -4    352.110    85.369   266.741
Step -3    352.927    85.225   267.702
Step -2    352.906    85.415   267.491
Step -1    350.841    85.374   265.468
Step  0    351.326    85.341   265.985
Step  1    352.796    85.370   267.426
Step  2    352.049    85.484   266.565
Step  3    352.823    85.602   267.221
Step  4    352.146    85.302   266.844
Step  5    351.188    85.428   265.760
Step  6    351.982    85.418   266.564
Step  7    352.618    85.845   266.773
Step  8    352.330    85.577   266.753
Step  9    352.272    85.588   266.684
Step 10    351.079    85.550   265.529
Step 11    352.096    85.725   266.371
Step 12    351.580    85.841   265.740
Step 13    352.896    85.731   267.165
Step 14    353.416    85.628   267.787
Step 15    353.156    85.588   267.568
Step 16    353.639    85.559   268.080
Step 17    352.277    85.589   266.688
Step 18    354.547    85.590   268.956
Step 19    353.859    85.491   268.367
Time per sample: 22.031 ms
FWD time per batch:  85.562 ms
BWD time per batch:  266.941 ms
TOT time per batch:  352.504 ms

./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/resnet50-b16.json

Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5    379.391   125.199   254.192
Step -4    378.747   124.355   254.392
Step -3    377.416   124.103   253.313
Step -2    376.796   124.041   252.754
Step -1    378.070   124.030   254.040
Step  0    378.771   124.463   254.308
Step  1    378.742   124.271   254.471
Step  2    378.577   124.467   254.110
Step  3    377.015   124.178   252.837
Step  4    378.414   123.964   254.450
Step  5    378.971   124.217   254.754
Step  6    378.131   124.310   253.821
Step  7    379.306   124.186   255.121
Step  8    378.495   124.412   254.084
Step  9    378.294   124.175   254.119
Step 10    378.118   124.102   254.016
Step 11    379.100   124.544   254.555
Step 12    379.154   124.618   254.537
Step 13    378.063   124.044   254.019
Step 14    378.734   124.403   254.331
Step 15    378.873   124.323   254.550
Step 16    378.347   124.337   254.010
Step 17    379.239   124.589   254.650
Step 18    377.577   124.037   253.540
Step 19    378.980   124.305   254.675
Time per sample: 23.659 ms
FWD time per batch:  124.297 ms
BWD time per batch:  254.248 ms
TOT time per batch:  378.545 ms

32 batch, ./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/resnet50-b32.json

Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (32,3,224,224)
Outputs
- loss: (32,1000)
Step -5    681.751   228.653   453.098
Step -4    679.952   226.915   453.037
Step -3    680.563   226.570   453.993
Step -2    678.810   226.432   452.379
Step -1    679.840   226.961   452.879
Step  0    682.178   226.847   455.331
Step  1    684.913   228.449   456.463
Step  2    680.624   226.812   453.811
Step  3    677.997   226.559   451.438
Step  4    679.704   226.519   453.185
Step  5    679.970   227.248   452.722
Step  6    680.107   226.908   453.199
Step  7    682.091   227.355   454.736
Step  8    681.425   226.812   454.613
Step  9    682.112   227.335   454.777
Step 10    680.621   227.558   453.062
Step 11    681.201   227.417   453.785
Step 12    681.783   227.400   454.384
Step 13    683.397   228.715   454.682
Step 14    681.637   227.780   453.857
Step 15    681.690   227.235   454.455
Step 16    685.070   228.631   456.439
Step 17    681.405   227.206   454.199
Step 18    679.156   226.926   452.230
Step 19    681.386   227.041   454.345
Time per sample: 21.294 ms
FWD time per batch:  227.338 ms
BWD time per batch:  454.086 ms
TOT time per batch:  681.423 ms

./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/resnet18-b16.json

Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5     83.440    26.896    56.544
Step -4     81.696    25.063    56.633
Step -3     81.949    25.267    56.682
Step -2     81.829    25.062    56.767
Step -1     81.736    25.103    56.633
Step  0     81.865    25.095    56.770
Step  1     81.740    25.068    56.671
Step  2     81.858    25.112    56.745
Step  3     81.969    25.150    56.819
Step  4     81.859    25.122    56.736
Step  5     81.890    25.081    56.809
Step  6     82.018    25.195    56.823
Step  7     81.895    25.089    56.806
Step  8     81.932    25.187    56.745
Step  9     81.817    25.098    56.719
Step 10     81.853    25.117    56.736
Step 11     81.861    25.147    56.715
Step 12     82.163    25.303    56.860
Step 13     81.848    25.108    56.741
Step 14     81.933    25.162    56.771
Step 15     81.901    25.153    56.748
Step 16     81.852    25.158    56.693
Step 17     81.819    25.082    56.737
Step 18     82.170    25.309    56.861
Step 19     81.808    25.138    56.670
Time per sample: 5.119 ms
FWD time per batch:  25.144 ms
BWD time per batch:  56.759 ms
TOT time per batch:  81.902 ms

./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/mobilenet_v2-b16.json

Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (16,3,224,224)
Outputs
- loss: (16,1000)
Step -5    131.939    38.944    92.995
Step -4    129.773    37.242    92.531
Step -3    129.645    37.257    92.389
Step -2    129.004    37.209    91.794
Step -1    129.638    37.212    92.426
Step  0    129.607    37.224    92.383
Step  1    129.527    37.204    92.323
Step  2    129.705    37.244    92.461
Step  3    129.596    37.211    92.385
Step  4    129.936    37.212    92.724
Step  5    129.977    37.213    92.764
Step  6    129.850    37.282    92.568
Step  7    129.749    37.216    92.533
Step  8    130.371    37.281    93.090
Step  9    129.819    37.139    92.680
Step 10    129.452    37.140    92.312
Step 11    130.007    37.252    92.755
Step 12    129.380    37.209    92.171
Step 13    129.932    37.180    92.752
Step 14    129.602    37.217    92.385
Step 15    129.497    37.066    92.431
Step 16    129.894    37.230    92.664
Step 17    129.765    37.156    92.609
Step 18    129.481    37.159    92.322
Step 19    129.886    37.231    92.656
Time per sample: 8.109 ms
FWD time per batch:  37.203 ms
BWD time per batch:  92.548 ms
TOT time per batch:  129.752 ms

32 batch size, ./dlprim_benchmark -b 1:0 ../docs/nets_for_benchmark/mobilenet_v2-b32.json

Using: gfx1010:xnack+ on AMD Accelerated Parallel Processing
Inputs
- data: (32,3,224,224)
Outputs
- loss: (32,1000)
Step -5    243.190    70.505   172.684
Step -4    241.889    69.351   172.537
Step -3    242.756    69.472   173.284
Step -2    241.985    69.558   172.426
Step -1    242.055    69.562   172.493
Step  0    242.479    69.397   173.082
Step  1    242.188    69.285   172.903
Step  2    242.308    69.254   173.053
Step  3    242.153    69.188   172.965
Step  4    242.305    69.187   173.118
Step  5    242.162    69.397   172.765
Step  6    241.815    69.275   172.540
Step  7    241.813    69.420   172.393
Step  8    242.101    69.418   172.683
Step  9    242.001    69.285   172.716
Step 10    242.652    69.407   173.244
Step 11    241.795    69.295   172.499
Step 12    242.038    69.334   172.704
Step 13    242.114    69.348   172.766
Step 14    242.507    69.528   172.980
Step 15    241.982    69.531   172.451
Step 16    242.181    69.355   172.827
Step 17    241.866    69.325   172.541
Step 18    242.328    69.446   172.882
Step 19    242.225    69.539   172.686
Time per sample: 7.567 ms
FWD time per batch:  69.361 ms
BWD time per batch:  172.790 ms
TOT time per batch:  242.151 ms
artyom-beilis commented 3 years ago

Cool now few comments If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python Also please run ./dlprim_bench 1:0 4 To see how well is it optimized for RDNA 1.

./dlprim_benchmark 1:0 4

Error:Failed to load json from 4, syntax error at line 1

My bad I meant: ./dlprim_flops 1:0 4

artyom-beilis commented 3 years ago

BTW, I put documentation online... not full yet but already useful

http://dlprimitives.org/docs/

masc-it commented 3 years ago

Cool now few comments If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python Also please run ./dlprim_bench 1:0 4 To see how well is it optimized for RDNA 1.

./dlprim_benchmark 1:0 4

Error:Failed to load json from 4, syntax error at line 1

My bad I meant: ./dlprim_flops 1:0 4

Testing on gfx1010:xnack+ on AMD Accelerated Parallel Processing
Testing memory speed
- Vector size 1
-- Warming 
-- Running   279.363 GB/s
- Vector size 2
-- Warming 
-- Running   293.186 GB/s
- Vector size 4
-- Warming 
-- Running   291.768 GB/s
- Vector size 8
-- Warming 
-- Running   295.808 GB/s
- Vector size 16
-- Warming 
-- Running   289.269 GB/s
Testing flops float
- Vector size 1
-- Warming 
-- Running   7300.63 GFlops
- Vector size 2
-- Warming 
-- Running   7417.01 GFlops
- Vector size 4
-- Warming 
-- Running   7376.02 GFlops
- Vector size 8
-- Warming 
-- Running   7317.28 GFlops
- Vector size 16
-- Warming 
-- Running   7292.22 GFlops
Testing flops half
- Vector size 1
-- Warming 
-- Running   7354.08 GFlops
- Vector size 2
-- Warming 
-- Running   14427.6 GFlops
- Vector size 4
-- Warming 
-- Running   14234.9 GFlops
- Vector size 8
-- Warming 
-- Running   14159.9 GFlops
- Vector size 16
-- Warming 
-- Running   14278.7 GFlops
Summray for gfx1010:xnack+ on AMD Accelerated Parallel Processing
Peak GFlops for float 7417.01
Peak GFlops for half 14427.6
Peak memory 295.808 GB/s
GEMM
  NN  0:  512,  512,  512     1974.2 GFlops (26.62%)     23.2 GB/s ( 8.01%) limited by gflops 26.62%
  NN  1: 1024, 1024, 1024     3306.0 GFlops (44.57%)     19.4 GB/s ( 6.70%) limited by gflops 44.57%
  NN  2: 1025, 1025, 1025     2739.2 GFlops (36.93%)     16.0 GB/s ( 5.55%) limited by gflops 36.93%
  NN  3: 2048, 2048, 2048     3749.6 GFlops (50.55%)     11.0 GB/s ( 3.80%) limited by gflops 50.55%
  NN  4: 2049, 2049, 2049     3379.1 GFlops (45.56%)      9.9 GB/s ( 3.42%) limited by gflops 45.56%
  NN  5:   64, 2048,   64      896.4 GFlops (12.09%)     57.4 GB/s (19.83%) limited by memory 19.83%
  NN  6: 2048,   64, 2048     1591.9 GFlops (21.46%)     52.9 GB/s (18.28%) limited by gflops 21.46%
  NN  7: 2048, 2048,   64     1851.3 GFlops (24.96%)     62.0 GB/s (21.42%) limited by gflops 24.96%
  NN  8: 2048,   64,   64      864.7 GFlops (11.66%)     55.3 GB/s (19.12%) limited by memory 19.12%
  NN  9:   64, 2048, 2048     2236.7 GFlops (30.16%)     74.3 GB/s (25.68%) limited by gflops 30.16%
  NN 10:   64,   64, 2048      483.4 GFlops ( 6.52%)     30.7 GB/s (10.61%) limited by memory 10.61%
  NT  0:  512,  512,  512     1353.0 GFlops (18.24%)     15.9 GB/s ( 5.49%) limited by gflops 18.24%
  NT  1: 1024, 1024, 1024     2149.1 GFlops (28.97%)     12.6 GB/s ( 4.36%) limited by gflops 28.97%
  NT  2: 1025, 1025, 1025     2443.7 GFlops (32.95%)     14.3 GB/s ( 4.95%) limited by gflops 32.95%
  NT  3: 2048, 2048, 2048     3312.4 GFlops (44.66%)      9.7 GB/s ( 3.36%) limited by gflops 44.66%
  NT  4: 2049, 2049, 2049     3380.5 GFlops (45.58%)      9.9 GB/s ( 3.42%) limited by gflops 45.58%
  NT  5:   64, 2048,   64      885.1 GFlops (11.93%)     56.6 GB/s (19.58%) limited by memory 19.58%
  NT  6: 2048,   64, 2048     1431.0 GFlops (19.29%)     47.5 GB/s (16.43%) limited by gflops 19.29%
  NT  7: 2048, 2048,   64     1831.0 GFlops (24.69%)     61.3 GB/s (21.18%) limited by gflops 24.69%
  NT  8: 2048,   64,   64      873.2 GFlops (11.77%)     55.9 GB/s (19.31%) limited by memory 19.31%
  NT  9:   64, 2048, 2048     1672.4 GFlops (22.55%)     55.5 GB/s (19.20%) limited by gflops 22.55%
  NT 10:   64,   64, 2048      467.7 GFlops ( 6.31%)     29.7 GB/s (10.27%) limited by memory 10.27%
  TN  0:  512,  512,  512     2108.0 GFlops (28.42%)     24.7 GB/s ( 8.55%) limited by gflops 28.42%
  TN  1: 1024, 1024, 1024     3394.5 GFlops (45.77%)     19.9 GB/s ( 6.88%) limited by gflops 45.77%
  TN  2: 1025, 1025, 1025     2666.2 GFlops (35.95%)     15.6 GB/s ( 5.40%) limited by gflops 35.95%
  TN  3: 2048, 2048, 2048     3821.7 GFlops (51.53%)     11.2 GB/s ( 3.87%) limited by gflops 51.53%
  TN  4: 2049, 2049, 2049     3381.2 GFlops (45.59%)      9.9 GB/s ( 3.42%) limited by gflops 45.59%
  TN  5:   64, 2048,   64      846.8 GFlops (11.42%)     54.2 GB/s (18.73%) limited by memory 18.73%
  TN  6: 2048,   64, 2048     2182.0 GFlops (29.42%)     72.5 GB/s (25.05%) limited by gflops 29.42%
  TN  7: 2048, 2048,   64     1886.7 GFlops (25.44%)     63.1 GB/s (21.83%) limited by gflops 25.44%
  TN  8: 2048,   64,   64      893.8 GFlops (12.05%)     57.2 GB/s (19.77%) limited by memory 19.77%
  TN  9:   64, 2048, 2048     2274.6 GFlops (30.67%)     75.5 GB/s (26.11%) limited by gflops 30.67%
  TN 10:   64,   64, 2048      468.2 GFlops ( 6.31%)     29.7 GB/s (10.28%) limited by memory 10.28%
  TT  0:  512,  512,  512     1764.9 GFlops (23.79%)     20.7 GB/s ( 7.16%) limited by gflops 23.79%
  TT  1: 1024, 1024, 1024     3318.9 GFlops (44.75%)     19.5 GB/s ( 6.73%) limited by gflops 44.75%
  TT  2: 1025, 1025, 1025     2583.8 GFlops (34.84%)     15.1 GB/s ( 5.23%) limited by gflops 34.84%
  TT  3: 2048, 2048, 2048     3683.3 GFlops (49.66%)     10.8 GB/s ( 3.73%) limited by gflops 49.66%
  TT  4: 2049, 2049, 2049     3377.4 GFlops (45.54%)      9.9 GB/s ( 3.42%) limited by gflops 45.54%
  TT  5:   64, 2048,   64      891.8 GFlops (12.02%)     57.1 GB/s (19.72%) limited by memory 19.72%
  TT  6: 2048,   64, 2048     2177.6 GFlops (29.36%)     72.3 GB/s (25.00%) limited by gflops 29.36%
  TT  7: 2048, 2048,   64     1877.0 GFlops (25.31%)     62.8 GB/s (21.71%) limited by gflops 25.31%
  TT  8: 2048,   64,   64      892.6 GFlops (12.04%)     57.1 GB/s (19.74%) limited by memory 19.74%
  TT  9:   64, 2048, 2048     1823.2 GFlops (24.58%)     60.6 GB/s (20.93%) limited by gflops 24.58%
  TT 10:   64,   64, 2048      474.2 GFlops ( 6.39%)     30.1 GB/s (10.41%) limited by memory 10.41%
Convolution
   0    alexnet  forward b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224    2553.4 GFlops (34.43%)     25.0 GB/s ( 8.65%) limited by gflops 34.43% algo=gemm
   0    alexnet bwd-data b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224     936.3 GFlops (12.62%)      9.2 GB/s ( 3.17%) limited by gflops 12.62% algo=gemm
   0    alexnet bwd-filt b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224    1645.9 GFlops (22.19%)     16.2 GB/s ( 5.58%) limited by gflops 22.19% algo=gemm
   1    alexnet  forward b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     2043.7 GFlops (27.55%)      5.2 GB/s ( 1.80%) limited by gflops 27.55% algo=gemm
   1    alexnet bwd-data b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27      935.0 GFlops (12.61%)      2.4 GB/s ( 0.82%) limited by gflops 12.61% algo=gemm
   1    alexnet bwd-filt b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     1475.1 GFlops (19.89%)      3.8 GB/s ( 1.32%) limited by gflops 19.89% algo=gemm
   2    alexnet  forward b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     2707.0 GFlops (36.50%)      4.6 GB/s ( 1.60%) limited by gflops 36.50% algo=gemm
   2    alexnet bwd-data b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     1004.3 GFlops (13.54%)      1.7 GB/s ( 0.59%) limited by gflops 13.54% algo=gemm
   2    alexnet bwd-filt b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     1931.3 GFlops (26.04%)      3.4 GB/s ( 1.17%) limited by gflops 26.04% algo=gemm
   3    alexnet  forward b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     5807.3 GFlops (78.30%)      9.5 GB/s ( 3.28%) limited by gflops 78.30% algo=winograd
   3    alexnet bwd-data b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     5359.8 GFlops (72.26%)      8.7 GB/s ( 3.02%) limited by gflops 72.26% algo=winograd
   3    alexnet bwd-filt b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     5112.9 GFlops (68.94%)      9.3 GB/s ( 3.21%) limited by gflops 68.94% algo=winograd
   4     resnet  forward b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224    2263.1 GFlops (30.51%)     36.6 GB/s (12.64%) limited by gflops 30.51% algo=gemm
   4     resnet bwd-data b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224    1183.1 GFlops (15.95%)     19.1 GB/s ( 6.61%) limited by gflops 15.95% algo=gemm
   4     resnet bwd-filt b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224     900.3 GFlops (12.14%)     14.5 GB/s ( 5.03%) limited by gflops 12.14% algo=gemm
   5     resnet  forward b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56      863.3 GFlops (11.64%)     33.7 GB/s (11.66%) limited by memory 11.66% algo=gemm
   5     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     2492.5 GFlops (33.61%)     97.4 GB/s (33.67%) limited by memory 33.67% algo=gemm
   5     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     1976.0 GFlops (26.64%)     77.2 GB/s (26.70%) limited by memory 26.70% algo=gemm
   6     resnet  forward b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56     1991.7 GFlops (26.85%)    124.5 GB/s (43.04%) limited by memory 43.04% algo=gemm
   6     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56     1752.7 GFlops (23.63%)    109.6 GB/s (37.87%) limited by memory 37.87% algo=gemm
   6     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56      683.0 GFlops ( 9.21%)     42.7 GB/s (14.76%) limited by memory 14.76% algo=gemm
   7     resnet  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56     5338.1 GFlops (71.97%)     37.1 GB/s (12.83%) limited by gflops 71.97% algo=winograd
   7     resnet bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56     2750.2 GFlops (37.08%)     19.1 GB/s ( 6.61%) limited by gflops 37.08% algo=winograd
   7     resnet bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56     4882.9 GFlops (65.83%)     34.0 GB/s (11.76%) limited by gflops 65.83% algo=winograd
   8     resnet  forward b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14      935.6 GFlops (12.61%)      6.1 GB/s ( 2.10%) limited by gflops 12.61% algo=gemm
   8     resnet bwd-data b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14      925.6 GFlops (12.48%)      6.0 GB/s ( 2.08%) limited by gflops 12.48% algo=gemm
   8     resnet bwd-filt b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14      914.9 GFlops (12.33%)      6.5 GB/s ( 2.26%) limited by gflops 12.33% algo=gemm
   9     resnet  forward b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14      876.9 GFlops (11.82%)      8.7 GB/s ( 3.01%) limited by gflops 11.82% algo=gemm
   9     resnet bwd-data b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14      961.2 GFlops (12.96%)      9.5 GB/s ( 3.30%) limited by gflops 12.96% algo=gemm
   9     resnet bwd-filt b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14      701.1 GFlops ( 9.45%)      7.1 GB/s ( 2.44%) limited by gflops  9.45% algo=gemm
  10     resnet  forward b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     6244.5 GFlops (84.19%)     11.8 GB/s ( 4.09%) limited by gflops 84.19% algo=winograd
  10     resnet bwd-data b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     5557.8 GFlops (74.93%)     10.5 GB/s ( 3.64%) limited by gflops 74.93% algo=winograd
  10     resnet bwd-filt b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     4982.5 GFlops (67.18%)     10.2 GB/s ( 3.54%) limited by gflops 67.18% algo=winograd
  11        vgg  forward b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224     965.6 GFlops (13.02%)     74.9 GB/s (25.89%) limited by memory 25.89% algo=gemm
  11        vgg bwd-data b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224     466.0 GFlops ( 6.28%)     36.1 GB/s (12.49%) limited by memory 12.49% algo=gemm
  11        vgg bwd-filt b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224     324.0 GFlops ( 4.37%)     25.1 GB/s ( 8.69%) limited by memory  8.69% algo=winograd
  12        vgg  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224    5419.6 GFlops (73.07%)     37.6 GB/s (13.01%) limited by gflops 73.07% algo=winograd
  12        vgg bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224    2456.4 GFlops (33.12%)     17.1 GB/s ( 5.90%) limited by gflops 33.12% algo=winograd
  12        vgg bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224    3424.1 GFlops (46.17%)     23.8 GB/s ( 8.22%) limited by gflops 46.17% algo=winograd
  13        vgg  forward b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28     6617.7 GFlops (89.22%)      6.0 GB/s ( 2.08%) limited by gflops 89.22% algo=winograd
  13        vgg bwd-data b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28     6938.4 GFlops (93.55%)      6.3 GB/s ( 2.18%) limited by gflops 93.55% algo=winograd
  13        vgg bwd-filt b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28     5956.2 GFlops (80.30%)      5.6 GB/s ( 1.95%) limited by gflops 80.30% algo=winograd
  14     mobile  forward b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224    1184.4 GFlops (15.97%)    120.6 GB/s (41.70%) limited by memory 41.70% algo=gemm
  14     mobile bwd-data b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224     371.9 GFlops ( 5.01%)     37.9 GB/s (13.09%) limited by memory 13.09% algo=gemm
  14     mobile bwd-filt b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224     172.9 GFlops ( 2.33%)     17.6 GB/s ( 6.09%) limited by memory  6.09% algo=gemm
  15     mobile  forward b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56      281.3 GFlops ( 3.79%)    125.0 GB/s (43.22%) limited by memory 43.22% algo=depthwise_separable
  15     mobile bwd-data b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56       50.3 GFlops ( 0.68%)     22.4 GB/s ( 7.73%) limited by memory  7.73% algo=depthwise_separable
  15     mobile bwd-filt b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56       77.0 GFlops ( 1.04%)     34.2 GB/s (11.83%) limited by memory 11.83% algo=depthwise_separable
  16     mobile  forward b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56       13.9 GFlops ( 0.19%)     15.4 GB/s ( 5.33%) limited by memory  5.33% algo=gemm
  16     mobile bwd-data b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56       12.5 GFlops ( 0.17%)     13.9 GB/s ( 4.82%) limited by memory  4.82% algo=gemm
  16     mobile bwd-filt b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56       24.9 GFlops ( 0.34%)     27.6 GB/s ( 9.55%) limited by memory  9.55% algo=gemm
  17     mobile  forward b=64 k=1  p=0 s=1 in=144  out=24   g=1   D=56      884.8 GFlops (11.93%)     86.0 GB/s (29.74%) limited by memory 29.74% algo=gemm
  17     mobile bwd-data b=64 k=1  p=0 s=1 in=144  out=24   g=1   D=56      594.4 GFlops ( 8.01%)     57.8 GB/s (19.98%) limited by memory 19.98% algo=gemm
  17     mobile bwd-filt b=64 k=1  p=0 s=1 in=144  out=24   g=1   D=56      861.2 GFlops (11.61%)     83.7 GB/s (28.95%) limited by memory 28.95% algo=gemm
  18     mobile  forward b=64 k=1  p=0 s=1 in=24   out=144  g=1   D=56      969.1 GFlops (13.07%)     94.2 GB/s (32.57%) limited by memory 32.57% algo=gemm
  18     mobile bwd-data b=64 k=1  p=0 s=1 in=24   out=144  g=1   D=56      833.7 GFlops (11.24%)     81.1 GB/s (28.02%) limited by memory 28.02% algo=gemm
  18     mobile bwd-filt b=64 k=1  p=0 s=1 in=24   out=144  g=1   D=56      867.8 GFlops (11.70%)     84.4 GB/s (29.17%) limited by memory 29.17% algo=gemm
  19     mobile  forward b=64 k=1  p=0 s=1 in=960  out=160  g=1   D=7      2012.2 GFlops (27.13%)     30.6 GB/s (10.59%) limited by gflops 27.13% algo=gemm
  19     mobile bwd-data b=64 k=1  p=0 s=1 in=960  out=160  g=1   D=7       933.9 GFlops (12.59%)     14.2 GB/s ( 4.91%) limited by gflops 12.59% algo=gemm
  19     mobile bwd-filt b=64 k=1  p=0 s=1 in=960  out=160  g=1   D=7      1677.5 GFlops (22.62%)     26.6 GB/s ( 9.20%) limited by gflops 22.62% algo=gemm
  20     mobile  forward b=64 k=1  p=0 s=1 in=960  out=320  g=1   D=7       747.6 GFlops (10.08%)      6.7 GB/s ( 2.32%) limited by gflops 10.08% algo=gemm
  20     mobile bwd-data b=64 k=1  p=0 s=1 in=960  out=320  g=1   D=7       952.2 GFlops (12.84%)      8.5 GB/s ( 2.95%) limited by gflops 12.84% algo=gemm
  20     mobile bwd-filt b=64 k=1  p=0 s=1 in=960  out=320  g=1   D=7       601.9 GFlops ( 8.11%)      5.8 GB/s ( 2.00%) limited by gflops  8.11% algo=gemm
  21     mobile  forward b=64 k=3  p=1 s=1 in=960  out=960  g=960 D=7       258.7 GFlops ( 3.49%)    115.1 GB/s (39.81%) limited by memory 39.81% algo=depthwise_separable
  21     mobile bwd-data b=64 k=3  p=1 s=1 in=960  out=960  g=960 D=7        46.2 GFlops ( 0.62%)     20.5 GB/s ( 7.10%) limited by memory  7.10% algo=depthwise_separable
  21     mobile bwd-filt b=64 k=3  p=1 s=1 in=960  out=960  g=960 D=7        43.5 GFlops ( 0.59%)     19.4 GB/s ( 6.70%) limited by memory  6.70% algo=depthwise_separable
  22      scale  forward b=64 k=1  p=0 s=1 in=256  out=256  g=256 D=56       47.5 GFlops ( 0.64%)    190.0 GB/s (65.70%) limited by memory 65.70% algo=depthwise_separable
  22      scale bwd-data b=64 k=1  p=0 s=1 in=256  out=256  g=256 D=56       25.1 GFlops ( 0.34%)    100.5 GB/s (34.75%) limited by memory 34.75% algo=depthwise_separable
  22      scale bwd-filt b=64 k=1  p=0 s=1 in=256  out=256  g=256 D=56       46.9 GFlops ( 0.63%)    187.6 GB/s (64.86%) limited by memory 64.86% algo=depthwise_separable
  23      scale  forward b=64 k=1  p=0 s=1 in=1024 out=1024 g=1024 D=7        42.0 GFlops ( 0.57%)    168.2 GB/s (58.14%) limited by memory 58.14% algo=depthwise_separable
  23      scale bwd-data b=64 k=1  p=0 s=1 in=1024 out=1024 g=1024 D=7        24.3 GFlops ( 0.33%)     97.2 GB/s (33.60%) limited by memory 33.60% algo=depthwise_separable
  23      scale bwd-filt b=64 k=1  p=0 s=1 in=1024 out=1024 g=1024 D=7         8.8 GFlops ( 0.12%)     35.3 GB/s (12.19%) limited by memory 12.19% algo=depthwise_separable
masc-it commented 3 years ago

Cool now few comments

If you want python support you need to install boost python3 and boost numpy3 and rebuild it so you can train from python

Also please run ./dlprim_bench 1:0 4

To see how well is it optimized for RDNA 1.

In order to make it recognize boost python and numpy, I had to slightly edit CMakelists.txt (L64-71)

find_package(PythonLibs 3)
find_package(Boost COMPONENTS python numpy)

if(PYTHONLIBS_FOUND AND Boost_NUMPY_FOUND AND Boost_PYTHON_FOUND)
    set(BUILD_PYDLPRIM TRUE)
else()
    set(BUILD_PYDLPRIM FALSE)
endif() 

Build output seems just fine now:

cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo

-- HDF5: Using hdf5 compiler wrapper to determine C configuration
=== Status ===
  OpenCL: include /usr/include
          lib     /usr/lib/x86_64-linux-gnu/libOpenCL.so
  Python: /usr/bin/python3
  BLAS: include /usr/include/x86_64-linux-gnu
        lib /usr/lib/x86_64-linux-gnu/libopenblas.so
  HDF5: include /usr/include/hdf5/serial
        lib  /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so hdf5_cpp
  Python dlprim: enabled
  Python: lib /usr/lib/x86_64-linux-gnu/libpython3.8.so
          include /usr/include/python3.8
  Boost: include /usr/include
     boost_numpy3 
     boost_python3 
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/88D86BFED86BE940/Projects/opensource/dlprimitives/build
artyom-beilis commented 3 years ago
find_package(Boost COMPONENTS python numpy)

The problem with that - it finds boost python and numpy for python2 instead of python 3 that I expect at runtime.

See:

$ dpkg -L libboost-numpy1.65-dev  | grep .so
/usr/lib/x86_64-linux-gnu/libboost_numpy-py27.so
/usr/lib/x86_64-linux-gnu/libboost_numpy.so
/usr/lib/x86_64-linux-gnu/libboost_numpy3-py36.so
/usr/lib/x86_64-linux-gnu/libboost_numpy3.so

Please check your installation of boost python/numpy and numpy in general under python3/pip3

masc-it commented 3 years ago

My output is: /usr/lib/x86_64-linux-gnu/libboost_numpy38.so

I don't think it is loading python2 deps, at least from the build output:

Boost: include /usr/include
     boost_numpy3 
     boost_python3

Furthermore, I do not have any boost python2 related libs installed.

dpkg -L libboost-python1.71-dev  | grep .so
/usr/lib/x86_64-linux-gnu/libboost_python38.so
artyom-beilis commented 3 years ago

And does it find boost_numpy3? What is content of relevant parts of cache file grep Boost CMakeCache.txt

artyom-beilis commented 3 years ago

Can you please check latest changeset 3544774f06cde2c

I added several search strategies of appropriate boost python/numpy

artyom-beilis commented 3 years ago

Can you please check latest changeset 3544774

I added several search strategies of appropriate boost python/numpy

Have you checked if this changes solved the problem?

masc-it commented 3 years ago

And does it find boost_numpy3? What is content of relevant parts of cache file grep Boost CMakeCache.txt

grep Boost CMakeCache.txt
//The directory containing a CMake configuration file for Boost.
Boost_DIR:PATH=/usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0
Boost_INCLUDE_DIR:PATH=/usr/include
Boost_NUMPY_LIBRARY_RELEASE:STRING=/usr/lib/x86_64-linux-gnu/libboost_numpy38.so.1.71.0
Boost_PYTHON_LIBRARY_RELEASE:STRING=/usr/lib/x86_64-linux-gnu/libboost_python38.so.1.71.0
//ADVANCED property for variable: Boost_DIR
Boost_DIR-ADVANCED:INTERNAL=1
//Details about finding Boost
FIND_PACKAGE_MESSAGE_DETAILS_Boost:INTERNAL=[/usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake][cfound components: python numpy ][v1.71.0()]

Can you please check latest changeset 3544774 I added several search strategies of appropriate boost python/numpy

Have you checked if this changes solved the problem?

Trying it right now, let you know soon.

masc-it commented 3 years ago

Can you please check latest changeset 3544774 I added several search strategies of appropriate boost python/numpy

Have you checked if this changes solved the problem?

cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo

-- HDF5: Using hdf5 compiler wrapper to determine C configuration
-- Could NOT find Boost: missing: python3 numpy3 (found /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found version "1.71.0"))
=== Status ===
  OpenCL: include /usr/include
          lib     /usr/lib/x86_64-linux-gnu/libOpenCL.so
  Python: /usr/bin/python3
  BLAS: include /usr/include/x86_64-linux-gnu
        lib /usr/lib/x86_64-linux-gnu/libopenblas.so
  HDF5: include /usr/include/hdf5/serial
        lib  /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so hdf5_cpp
  Python dlprim: enabled
  Python version: 38
  Python: lib /usr/lib/x86_64-linux-gnu/libpython3.8.so
          include /usr/include/python3.8
  Boost: include /usr/include
     boost_numpy Boost::numpy
     boost_python Boost::python
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/88D86BFED86BE940/Projects/opensource/dlprimitives/build

Despite the fact it says Could NOT find Boost: missing: python3 numpy3 it successfully builds the python interface (when I run make):

[ 79%] Building CXX object CMakeFiles/pydlprim.dir/python/python_interface.cpp.o
[ 81%] Building CXX object CMakeFiles/test_net.dir/tests/test_net.cpp.o
[ 83%] Linking CXX executable test_random
[ 83%] Built target test_random
[ 84%] Linking CXX executable mnist
[ 86%] Linking CXX executable dlprim_flops
[ 86%] Built target dlprim_flops
[ 86%] Built target mnist
[ 88%] Linking CXX executable image_predict
[ 90%] Linking CXX executable train_mnist
[ 90%] Built target train_mnist
[ 90%] Built target image_predict
[ 92%] Linking CXX executable test_net
[ 94%] Linking CXX executable dlprim_benchmark
[ 94%] Built target test_net
[ 94%] Built target dlprim_benchmark
[ 96%] Linking CXX executable test_from_template
[ 98%] Linking CXX executable test_json
[ 98%] Built target test_from_template
[ 98%] Built target test_json
[100%] Linking CXX shared library python/dlprim/_pydlprim.so
[100%] Built target pydlprim
artyom-beilis commented 3 years ago

Because I search for boost_python3 and boost_python3x

artyom-beilis commented 3 years ago

Closing as looks like issue resolved