[Bug]: rocBLAS gemm in python-interface produces wrong results when called more than once

paolodalberto commented 1 year ago

Describe the bug

I created a simple python interface. It works using small example but not in an application. I will provide example and code to reproduce the problem and I think it is a sync problem and I need your help.

To Reproduce

I used a docker tensorflow:latest using rocm 5.6
I installed the clients using install.sh to get the clients code and learn how to write a simple interface.

https://github.com/paolodalberto/MatrixFlow/tree/main/GpuInterface

in particular a simple gemm wrapper https://github.com/paolodalberto/MatrixFlow/blob/main/GpuInterface/gpuinterface.cpp#L425

where I sample the inputs and the outputs to check consistency and I create a single handle, device and prop.

I created a simple case (temp.py) where I call csr_mv, coo_mv, and dgemm. This standalone test works just fine. But then I play with an application

python Examples/play.py where I compare the execution of gemm every time. The first call is correct and the second (no matter how I call it is off)

Expected behavior

Then I use the interface in an application and I compare the result to the usual CPU gemm https://github.com/paolodalberto/MatrixFlow/blob/main/Matrices/matrices.py#L63

root@fastmmw:/matrixflow# python Examples/play.py

    rocblas_status rocblas_dgemm(
        rocblas_handle handle, 
        rocblas_operation transA, 
        rocblas_operation transB, 
        rocblas_int m, 
        rocblas_int n, 
        rocblas_int k, 
        const double *alpha, 
        const double *A, 
        rocblas_int lda, 
        const double *B, 
        rocblas_int ldb, 
        const double *beta, 
        double *C, 
        rocblas_int ldc) 

    rocblas_status rocblas_dgema(
        rocblas_handle handle, 
        rocblas_operation transA, 
        rocblas_operation transB, 
        rocblas_int m, 
        rocblas_int n, 
        rocblas_int k, 
        const double *alpha, 
        const double *A, 
        rocblas_int lda, 
        const double *B, 
        rocblas_int ldb, 
        const double *beta, 
        double *C, 
        rocblas_int ldc) 
rocblas_dgemm( %s, 'n', 'n', %d, %d, %d,  %s, %s, %d, %s, %d,  %s, %s, %d) 
rocblas_dgema( %s, 'n', 'n', %d, %d, %d  %s, %s, %d,  %s, %s, %s,  %s, %d) 

 Fast Matrix Multiplication MxK * KxN -> MxN 
 Introduce --M --K --N
 We show case the rest
compute
     BLAS  0x3c260b0
0 Device: AMD Radeon VII
     A 1 2
     B 1 2
     C 0 0
     Data and Initialization Kernel 10.1151
     Time Kernel 0.39634
     C <- 173880 347760
     Read data from  Kernel 0.000689
m, n, k, lda, ldb, ldc = 80, 80, 80, 80, 80, 80, alpha=1, beta=0
(80, 80) 0.0
time 13.049914598464966
(4, 7)
     BLAS  0x3c260b0
0 Device: AMD Radeon VII
     A -1640 -1680
     B 41 82
     C 0 0
     Data and Initialization Kernel 0.000169
     Time Kernel 0.000137
     C <- -9.01016e+07 -9.22992e+07
     Read data from  Kernel 6.5e-05
m, n, k, lda, ldb, ldc = 40, 40, 40, 40, 40, 40, alpha=1, beta=0
(40, 40) -20637760000.0

If you execute the last operation using GPU1

 C = Matrix(B)
(Pdb) B = numpy.matrix(rocmgpu.gemm(1,L.A1, L.shape[1], R.A1, R.shape[1]))
     BLAS  0x1f0ecf0
1 Device: AMD Radeon VII
     A -1640 -1680
     B 41 82
     C 0 0
     Data and Initialization Kernel 0.223054
     Time Kernel 4.1e-05
     C <- 0 0
     Read data from  Kernel 0.000549
m, n, k, lda, ldb, ldc = 40, 40, 40, 40, 40, 40, alpha=1, beta=0

As you can see the return matrix from the device is off

Log-files

Add full logfiles to help explain your problem.

Environment

Hardware	description
CPU	device name
GPU	device name

The above hardware Table information can be generated by command:

rocminfo | grep Marketing

root@fastmmw:/matrixflow# /opt/rocm/bin/rocminfo 
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen Threadripper 1950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen Threadripper 1950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65759804(0x3eb6a3c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65759804(0x3eb6a3c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65759804(0x3eb6a3c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-25c430a172e17d3e               
  Marketing Name:          AMD Radeon VII                     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26287(0x66af)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   2816                               
  Internal Node ID:        1                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc-:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-23ac796172fd5d6b               
  Marketing Name:          AMD Radeon VII                     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26287(0x66af)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   17408                              
  Internal Node ID:        2                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc-:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32

Software	version
rocm-core	v0.0
rocblas	v0.0

The above software Table information can be queried with:

Ubuntu/Debian:
dpkg -s rocm-core | grep Version
dpkg -s rocblas | grep Version
Centos/RHEL:
rpm -qa | grep rocm-core
rpm -qa | grep rocblas
SLES:
zypper se -s | grep rocm-core
zypper se -s | grep rocblas

root@fastmmw:/matrixflow# dpkg -s rocm-core | grep Version
Version: 5.6.0.50600-67~20.04
root@fastmmw:/matrixflow# dpkg -s rocblas | grep Version
Version: 3.0.0.50600-67~20.04

Make sure that ROCm is correctly installed and to capture detailed environment information run the following command:


[environment.txt](https://github.com/ROCmSoftwarePlatform/rocBLAS/files/12242823/environment.txt)```

Attach `environment.txt`

### Additional context
Add any other context about the problem here.

paolodalberto commented 1 year ago

At this point I do not really fancy performance but to have a clear understanding how a simple interface could be made. Eventually I am planning to create kernels codes .. but this is a necessary first step. We can discuss internally if you like.

paolodalberto commented 1 year ago

Thank you for your time

paolodalberto commented 1 year ago

I have found another corner case for gemv

(Pdb) n
     BLAS GEMV 0x239dcf0
0 Device: AMD Radeon VII
     A 0 0
     B -52.9725 -34.4124
     C 0.5 23.375
     Data and Initialization Kernel 0.223526
m, n  = 72, 64, 64, alpha=1, beta=1
4 Error: rocblas error in line  61
invalid size parameter.
     Time Kernel 1e-05
     C <- 0.5 23.375
     Read data from  Kernel 0.000573
0.5

This time I do some parameters wrong ... which one ?

paolodalberto commented 1 year ago

The matrices Layout for rocblas are Fortran like I fixed the problem for above for GEMV ... checking what I am doing wrong for GEMM

paolodalberto commented 1 year ago

V = rocmgpu.gemm(0,LL.A.flatten('F'), LL.shape[0],RR.A.flatten(), RR.shape[1],Result.A.flatten(),Result.shape[1]) This will work :)

paolodalberto commented 1 year ago

nice @rkamd I think I have found out what I am doing wrong :)

paolodalberto commented 1 year ago

I will share the code and consideration tomorrow and I will close the issue. Paolo

rkamd commented 1 year ago

@paolodalberto, sounds good.

TorreZuk commented 1 year ago

@paolodalberto just to clarify in rocBLAS all matrices are column major memory ordered (unfortunately this is python value 'F' and 'C' was used for row major as they are language based abbreviations). See https://rocmdocs.amd.com/projects/rocBLAS/en/develop/API_Reference_Guide.html in the Note section. You can create your numpy array with this layout https://numpy.org/doc/stable/reference/generated/numpy.array.html as the constructor and other methods like reshape may take the argument order='F' to use this layout. It will be good to document this in your python interface as well.

paolodalberto commented 1 year ago

@TorreZuk yep it took a little to pivot but I think I managed :) @rkamd I will close this as reference my basic interface looks like this https://github.com/paolodalberto/MatrixFlow/blob/main/GpuInterface/gpuinterface.cpp https://github.com/paolodalberto/MatrixFlow/blob/main/GpuInterface/procm.py

I have now a simple toy for sparse and dense, GPU and not GPU for GEMV and GPU and not GPU for GEMM

cheers

paolodalberto commented 1 year ago

please contact me directly if you have any further questions

ROCm / rocBLAS