danielzgtg commented 1 year ago

So I tried running this on the provided mnist.py. My main use case is that I'm on a laptop and I want it to train faster and use less battery. What I did not expect is that it was 3x slower using the GPU than using the CPU. I am on Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz with intel-opencl-icd. Furthermore, the GPU version caused the laptop fans to start spinning, but the CPU version didn't.

Expected behaviour

(venv) home@daniel-tablet1:~/PycharmProjects/pytorch_dlprim$ python3 mnist.py --device cpu
/home/home/PycharmProjects/whisper/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.337603
Train Epoch: 1 [640/60000 (1%)] Loss: 1.000659
Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.534252
Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.306895
Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.328317
Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.169952
Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.152536
Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.228139
Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.211874
Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.113282
Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.173121
Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.139788
Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.186882
Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.099785
Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.147153
Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.190826
Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.385145
Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.154965
Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.258187
Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.147772
Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.122823
Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.150513
Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.090943
Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.208224
Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.074682
Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.091023
Train Epoch: 1 [16640/60000 (28%)]      Loss: 0.193498
Train Epoch: 1 [17280/60000 (29%)]      Loss: 0.048429
Train Epoch: 1 [17920/60000 (30%)]      Loss: 0.114691
Train Epoch: 1 [18560/60000 (31%)]      Loss: 0.103097
Train Epoch: 1 [19200/60000 (32%)]      Loss: 0.111526
Train Epoch: 1 [19840/60000 (33%)]      Loss: 0.026821
Train Epoch: 1 [20480/60000 (34%)]      Loss: 0.018942
Train Epoch: 1 [21120/60000 (35%)]      Loss: 0.079938
Train Epoch: 1 [21760/60000 (36%)]      Loss: 0.014885
Train Epoch: 1 [22400/60000 (37%)]      Loss: 0.042647
Train Epoch: 1 [23040/60000 (38%)]      Loss: 0.215288
Train Epoch: 1 [23680/60000 (39%)]      Loss: 0.138436
Train Epoch: 1 [24320/60000 (41%)]      Loss: 0.011650
Train Epoch: 1 [24960/60000 (42%)]      Loss: 0.028758
Train Epoch: 1 [25600/60000 (43%)]      Loss: 0.033963
Train Epoch: 1 [26240/60000 (44%)]      Loss: 0.026172
Train Epoch: 1 [26880/60000 (45%)]      Loss: 0.119763
Train Epoch: 1 [27520/60000 (46%)]      Loss: 0.122162
Train Epoch: 1 [28160/60000 (47%)]      Loss: 0.073711
Train Epoch: 1 [28800/60000 (48%)]      Loss: 0.031891
Train Epoch: 1 [29440/60000 (49%)]      Loss: 0.032309
Train Epoch: 1 [30080/60000 (50%)]      Loss: 0.058146
Train Epoch: 1 [30720/60000 (51%)]      Loss: 0.044536
Train Epoch: 1 [31360/60000 (52%)]      Loss: 0.023220
Train Epoch: 1 [32000/60000 (53%)]      Loss: 0.093438
Train Epoch: 1 [32640/60000 (54%)]      Loss: 0.022575
Train Epoch: 1 [33280/60000 (55%)]      Loss: 0.056749
Train Epoch: 1 [33920/60000 (57%)]      Loss: 0.030043
Train Epoch: 1 [34560/60000 (58%)]      Loss: 0.022908
Train Epoch: 1 [35200/60000 (59%)]      Loss: 0.084108
Train Epoch: 1 [35840/60000 (60%)]      Loss: 0.185571
Train Epoch: 1 [36480/60000 (61%)]      Loss: 0.017673
Train Epoch: 1 [37120/60000 (62%)]      Loss: 0.084662
Train Epoch: 1 [37760/60000 (63%)]      Loss: 0.080484
Train Epoch: 1 [38400/60000 (64%)]      Loss: 0.117529
Train Epoch: 1 [39040/60000 (65%)]      Loss: 0.003176
Train Epoch: 1 [39680/60000 (66%)]      Loss: 0.071565
Train Epoch: 1 [40320/60000 (67%)]      Loss: 0.108479
Train Epoch: 1 [40960/60000 (68%)]      Loss: 0.092688
Train Epoch: 1 [41600/60000 (69%)]      Loss: 0.048416
Train Epoch: 1 [42240/60000 (70%)]      Loss: 0.009381
Train Epoch: 1 [42880/60000 (71%)]      Loss: 0.038555
Train Epoch: 1 [43520/60000 (72%)]      Loss: 0.089673
Train Epoch: 1 [44160/60000 (74%)]      Loss: 0.020524
Train Epoch: 1 [44800/60000 (75%)]      Loss: 0.092968
Train Epoch: 1 [45440/60000 (76%)]      Loss: 0.068793
Train Epoch: 1 [46080/60000 (77%)]      Loss: 0.094527
Train Epoch: 1 [46720/60000 (78%)]      Loss: 0.154815
Train Epoch: 1 [47360/60000 (79%)]      Loss: 0.066463
Train Epoch: 1 [48000/60000 (80%)]      Loss: 0.037426
Train Epoch: 1 [48640/60000 (81%)]      Loss: 0.030952
Train Epoch: 1 [49280/60000 (82%)]      Loss: 0.013815
Train Epoch: 1 [49920/60000 (83%)]      Loss: 0.043523
Train Epoch: 1 [50560/60000 (84%)]      Loss: 0.044266
Train Epoch: 1 [51200/60000 (85%)]      Loss: 0.176199
Train Epoch: 1 [51840/60000 (86%)]      Loss: 0.024092
Train Epoch: 1 [52480/60000 (87%)]      Loss: 0.014346
Train Epoch: 1 [53120/60000 (88%)]      Loss: 0.038723
Train Epoch: 1 [53760/60000 (90%)]      Loss: 0.073435
Train Epoch: 1 [54400/60000 (91%)]      Loss: 0.017709
Train Epoch: 1 [55040/60000 (92%)]      Loss: 0.019962
Train Epoch: 1 [55680/60000 (93%)]      Loss: 0.106418
Train Epoch: 1 [56320/60000 (94%)]      Loss: 0.010950
Train Epoch: 1 [56960/60000 (95%)]      Loss: 0.023096
Train Epoch: 1 [57600/60000 (96%)]      Loss: 0.033030
Train Epoch: 1 [58240/60000 (97%)]      Loss: 0.007997
Train Epoch: 1 [58880/60000 (98%)]      Loss: 0.000659
Train Epoch: 1 [59520/60000 (99%)]      Loss: 0.001612
Epoch in  34.0s

Actual behaviour

(venv) home@daniel-tablet1:~/PycharmProjects/pytorch_dlprim$ python3 mnist.py --device opencl:0
/home/home/PycharmProjects/whisper/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Accessing device #0:Intel(R) Iris(R) Plus Graphics [0x8a52] on Intel(R) OpenCL HD Graphics
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.326378
Train Epoch: 1 [640/60000 (1%)] Loss: 1.373419
Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.674224
Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.342615
Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.282575
Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.321835
Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.117600
Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.174937
Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.295922
Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.179234
Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.148632
Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.247433
Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.097251
Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.170669
Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.099438
Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.183732
Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.096929
Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.091889
Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.056076
Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.081981
Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.137648
Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.124434
Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.038791
Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.150997
Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.082680
Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.044054
Train Epoch: 1 [16640/60000 (28%)]      Loss: 0.147787
Train Epoch: 1 [17280/60000 (29%)]      Loss: 0.047737
Train Epoch: 1 [17920/60000 (30%)]      Loss: 0.056453
Train Epoch: 1 [18560/60000 (31%)]      Loss: 0.023077
Train Epoch: 1 [19200/60000 (32%)]      Loss: 0.036574
Train Epoch: 1 [19840/60000 (33%)]      Loss: 0.011139
Train Epoch: 1 [20480/60000 (34%)]      Loss: 0.027549
Train Epoch: 1 [21120/60000 (35%)]      Loss: 0.028380
Train Epoch: 1 [21760/60000 (36%)]      Loss: 0.131590
Train Epoch: 1 [22400/60000 (37%)]      Loss: 0.192181
Train Epoch: 1 [23040/60000 (38%)]      Loss: 0.070133
Train Epoch: 1 [23680/60000 (39%)]      Loss: 0.124290
Train Epoch: 1 [24320/60000 (41%)]      Loss: 0.114533
Train Epoch: 1 [24960/60000 (42%)]      Loss: 0.011495
Train Epoch: 1 [25600/60000 (43%)]      Loss: 0.031055
Train Epoch: 1 [26240/60000 (44%)]      Loss: 0.058615
Train Epoch: 1 [26880/60000 (45%)]      Loss: 0.112524
Train Epoch: 1 [27520/60000 (46%)]      Loss: 0.029194
Train Epoch: 1 [28160/60000 (47%)]      Loss: 0.047580
Train Epoch: 1 [28800/60000 (48%)]      Loss: 0.022058
Train Epoch: 1 [29440/60000 (49%)]      Loss: 0.064951
Train Epoch: 1 [30080/60000 (50%)]      Loss: 0.081404
Train Epoch: 1 [30720/60000 (51%)]      Loss: 0.072505
Train Epoch: 1 [31360/60000 (52%)]      Loss: 0.096956
Train Epoch: 1 [32000/60000 (53%)]      Loss: 0.106381
Train Epoch: 1 [32640/60000 (54%)]      Loss: 0.018265
Train Epoch: 1 [33280/60000 (55%)]      Loss: 0.061221
Train Epoch: 1 [33920/60000 (57%)]      Loss: 0.070425
Train Epoch: 1 [34560/60000 (58%)]      Loss: 0.089722
Train Epoch: 1 [35200/60000 (59%)]      Loss: 0.151525
Train Epoch: 1 [35840/60000 (60%)]      Loss: 0.068132
Train Epoch: 1 [36480/60000 (61%)]      Loss: 0.011085
Train Epoch: 1 [37120/60000 (62%)]      Loss: 0.111000
Train Epoch: 1 [37760/60000 (63%)]      Loss: 0.040008
Train Epoch: 1 [38400/60000 (64%)]      Loss: 0.012150
Train Epoch: 1 [39040/60000 (65%)]      Loss: 0.059965
Train Epoch: 1 [39680/60000 (66%)]      Loss: 0.042966
Train Epoch: 1 [40320/60000 (67%)]      Loss: 0.109453
Train Epoch: 1 [40960/60000 (68%)]      Loss: 0.099907
Train Epoch: 1 [41600/60000 (69%)]      Loss: 0.073859
Train Epoch: 1 [42240/60000 (70%)]      Loss: 0.049867
Train Epoch: 1 [42880/60000 (71%)]      Loss: 0.033700
Train Epoch: 1 [43520/60000 (72%)]      Loss: 0.006360
Train Epoch: 1 [44160/60000 (74%)]      Loss: 0.051153
Train Epoch: 1 [44800/60000 (75%)]      Loss: 0.113450
Train Epoch: 1 [45440/60000 (76%)]      Loss: 0.008563
Train Epoch: 1 [46080/60000 (77%)]      Loss: 0.046368
Train Epoch: 1 [46720/60000 (78%)]      Loss: 0.089523
Train Epoch: 1 [47360/60000 (79%)]      Loss: 0.008030
Train Epoch: 1 [48000/60000 (80%)]      Loss: 0.237780
Train Epoch: 1 [48640/60000 (81%)]      Loss: 0.091529
Train Epoch: 1 [49280/60000 (82%)]      Loss: 0.022425
Train Epoch: 1 [49920/60000 (83%)]      Loss: 0.017645
Train Epoch: 1 [50560/60000 (84%)]      Loss: 0.022220
Train Epoch: 1 [51200/60000 (85%)]      Loss: 0.057755
Train Epoch: 1 [51840/60000 (86%)]      Loss: 0.016291
Train Epoch: 1 [52480/60000 (87%)]      Loss: 0.061722
Train Epoch: 1 [53120/60000 (88%)]      Loss: 0.046042
Train Epoch: 1 [53760/60000 (90%)]      Loss: 0.089375
Train Epoch: 1 [54400/60000 (91%)]      Loss: 0.017928
Train Epoch: 1 [55040/60000 (92%)]      Loss: 0.006611
Train Epoch: 1 [55680/60000 (93%)]      Loss: 0.012605
Train Epoch: 1 [56320/60000 (94%)]      Loss: 0.153086
Train Epoch: 1 [56960/60000 (95%)]      Loss: 0.037731
Train Epoch: 1 [57600/60000 (96%)]      Loss: 0.119136
Train Epoch: 1 [58240/60000 (97%)]      Loss: 0.029190
Train Epoch: 1 [58880/60000 (98%)]      Loss: 0.007807
Train Epoch: 1 [59520/60000 (99%)]      Loss: 0.051748
Epoch in  95.5s

artyom-beilis commented 1 year ago

First of all Intel GPU has quite poor performance on itself. How many compute units the GPU has? Show clinfo output? It isn't even clear if GPU has more gflops than CPU.

run dlprim_flops on this device to see general performance.

Also mnist is really tiny network that does benefit much of even high performance GPU. On the other hand pytorch comes with highly optimized CPU implementation from Intel. So it is hard to beat CPU in this case.

In any case, using channel first memory order pytorch is using the performance of my implementation is actually is good and sometimes even better than GPU routines implemented by Intel.

artyom-beilis commented 1 year ago

See this issue: https://github.com/oneapi-src/oneDNN/issues/1194 regarding the Intel's implementation of GPU operations performance.

kurnevsky commented 1 year ago

First of all, thanks for the project! It's really cool that pytorch can utilize opencl now (I'm not fan of nvidia, and very unlikely will have their video card).

I have very similar setup and results - opencl on intel video card is more than 3 times slower than when executing on 4 cores of cpu. But I also have an older laptop which has a discrete amd card with mesa clover opencl. And there I have ~1.5 performance boost comparing to cpu (which is still slower than cpu on newer laptop). But probably it all will depend on a task that is executed.

How many compute units the GPU has? Show clinfo output? It isn't even clear if GPU has more gflops than CPU.

Here what I have for the intel card:

  Max compute units                               24
  Max clock frequency                             1150MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple (device)     32
  Preferred work group size multiple (kernel)     32
  Max sub-groups per work group                   32
  Sub-group sizes (Intel)                         8, 16, 32

danielzgtg commented 1 year ago

I also have a old desktop with an AMD card. It's a shame that it's stuck on OpenCL 1 with Clover, but Rusticl has OpenCL 3. Mesa 22.3 should be out by December.

artyom-beilis commented 1 year ago

OpenCL 1 with Clover,

AFAIR it is OpenCL 1.1. I tested rx560 with Clover driver and it worked fine - was somewhat slower than ROCm but yet worked.

tangjinchuan commented 8 months ago

Dear Artyom-beilis,

I am a huge fan of your fantastic work from the beginning. I am a volunteer to help tuning CLBlast for many devices (https://github.com/CNugteren/CLBlast/issues/1).

I know you also mentioned CLBlast before on Winograd or something related to it. If I remember your DLPrimitives kernel correctly, you are using your own GEMM. I am just wondering if it is properly tunned for Intel GPUs. I know Intel is good at tuning CPU gemm (MKL), however, their pervious opensource implementation on OpenCL with OpenCV were good for textbook. Therefore, to speed up its GPU, I guess CLBlast is good. After tuning with CLBlast, 3x performance boost were achieved with my iGPU (before tuning, GPU is worse than CPU). My testing was here: https://www.researchgate.net/figure/i7-1165g7-tuned-clblast_fig9_366394124. So, may I suggest adapting CLBlast to DLPrimitives to see if it gets more speedup on fully connected layers/Conv on im2col. In addition, it has been used in llamma.cpp to achieve comparative performance as cuBLAS. Please forgive me if I am wrong. I am just very interested in your fantastic work.

Best wishes, Jinchuan

artyom-beilis commented 8 months ago

1st of all I'm not sure that clblast has much advantage over dlprimitives. For example GEMM m=n=k=512 both give ~130-140 GFlops.

For example my gemm: Intel(R) UHD Graphics 630 [0x3e9b] on Intel(R) OpenCL HD Graphics When this GPU top GLOPS is around 401.413

dlprimitives sgemm

GEMM
  NN  0:  512,  512,  512      136.9 GFlops (34.09%)      1.6 GB/s (11.07%) limited by gflops 34.09%
  NT  0:  512,  512,  512      104.4 GFlops (26.01%)      1.2 GB/s ( 8.44%) limited by gflops 26.01%
  TN  0:  512,  512,  512      190.9 GFlops (47.56%)      2.2 GB/s (15.44%) limited by gflops 47.56%
  TT  0:  512,  512,  512      162.0 GFlops (40.35%)      1.9 GB/s (13.10%) limited by gflops 40.35%

clblast sgemm

NN: 143.772 GFLOPS
TN: 154.547
NT: 129.747
TT: 143.407

So it not necessary clear cut that clblast has better performance. Also note that when running convolition the im2col is actually integrated to the gemm itself so switching to clblast isn't trivial at all.

On the same note I need to mention that Intel's own deep learning library does not run well for channel first convolution since they just don't optimise for one. https://github.com/oneapi-src/oneDNN/issues/1194 Actually dlprimitives performs better on channel 1st order.

tangjinchuan commented 8 months ago

Hi, the problem is the size, for example MINIST has 60000 training samples, each with a size of 28x28 pixels. Hence, I am very interested in how this (60000x742 x 742x742 as an example to imitate the input and the weights to the hidden layer) performs compared to 512*512. Based on my experience, GPU gemm normally performs close to CPU when the size is small.

Best wishes

artyom-beilis commented 8 months ago

GPU gemm normally performs close to CPU when the size is small.

Which GPU do you have?

Typical max Intel Skylake GPU has ~400GFPLOPS which is quite a few. But typical multi core cpu has much more flops

If I run torch.matmul on same computer where I have 400GFLOPs GPU on CPU I get ~500GFLOPs performance (for mat mul i.e. way-way more than I can get from GPU.

Intel GPUs unfortunately aren't as powerful as modern data CPUs - they are good for graphics - less so for computation.

So I'm not surprised that you have better performance for CPU rather than on GPU.

tangjinchuan commented 7 months ago

Dear Artyom, I have several Intel iGPU and Arc GPUs in my lab. Take i7-1165G7 as an example, what I got is here: i7-1165g7.zip I admit that the code I implemented did not consider zero-copy for iGPU. All blue dots on both graphs are for iGPU. This is the case Intel make better GPU than CPU cores (CPU used Openblas).

Best wishes, Jinchuan

artyom-beilis commented 7 months ago

I have several Intel iGPU and Arc GPUs in my lab.

Nice. Can you post benchmark numbers for Arc GPUs? i.e run ./dlprim_flops I have not managed to test it or tune for it.

CPU cores (CPU used Openblas).

OpenBLAS is somewhat smaller. I tested with torch.matmul and it is damn fast. The 500GFPLOPs was with torch.matmul. OpenBLAS's sgemm gives only 250 GFLOPS

tangjinchuan commented 7 months ago

Dear Artyom, Please let me interest you with something more interesting as on an AMD 7900XTX. I freshly build a windows dlprim_flops and receive the following result: 7900XTX.txt

I will try to make a test on ARC 770 16G when I find a free machine to plug off the CUDA/AMD card and then plug in the ARC card.

For torch, I believe they can use MKL for Blas if it is installed via Conda, Otherwise, it should be OpenBlas.

Best wishes, Jinchuan

tangjinchuan commented 7 months ago

Dear Artyom, I am also supervising an undergraduate student to do a comparative study on the performance of different GPU random number generators. May I ask if there is any paper/material that is related to the random kernel? (https://github.com/artyom-beilis/dlprimitives/blob/master/src/kernels/random.cl) Thank you very much! Best wishes, Jinchuan

artyom-beilis commented 7 months ago

I am also supervising an undergraduate student to do a comparative study on the performance of different GPU random number generators. May I ask if there is any paper/material that is related to the random kernel?

It is fairly standard Philox-4x32, see here the paper: https://www.thesalmons.org/john/random123/papers/random123sc11.pdf

I mentioned it in API: https://github.com/artyom-beilis/dlprimitives/blob/master/include/dlprim/core/common.hpp#L47

artyom-beilis commented 7 months ago

Dear Artyom, Please let me interest you with something more interesting as on an AMD 7900XTX. I freshly build a windows dlprim_flops and receive the following result: 7900XTX.txt

Best wishes, Jinchuan

Cool. It looks like my measurement of FLOPS are way incorrect it should be much more capable GPU according to wikipedia. That is why I get more than 100% flops for some GEMM ops.

tangjinchuan commented 7 months ago

Apple M2 MAX.txt HDF5_cpp is not built by default according to its website, hence no such lib from homebrew. Thus, disabled HDF5 lib in the makefile to get there.

tangjinchuan commented 7 months ago

I am also supervising an undergraduate student to do a comparative study on the performance of different GPU random number generators. May I ask if there is any paper/material that is related to the random kernel?

It is fairly standard Philox-4x32, see here the paper: https://www.thesalmons.org/john/random123/papers/random123sc11.pdf

I mentioned it in API: https://github.com/artyom-beilis/dlprimitives/blob/master/include/dlprim/core/common.hpp#L47

Thank you very much!

tangjinchuan commented 7 months ago

Dear Artyom, https://github.com/artyom-beilis/dlprimitives/blob/46b9d17b76c40d05b323a1a0ea484d61ac5f17b2/src/core/pointwise.cpp#L77C24-L77C25 would this be quite expensive to handling \n to \\\n

using for loop, or is this necessary? I did not find similar lines to specifically handle this in CLBlast which also used R" for kernels. https://github.com/CNugteren/CLBlast/blob/master/src/utilities/compile.cpp#L24

Best wishes, Jinchuan

artyom-beilis commented 7 months ago

The pointwise/broadcast is a special kind of kernels that are shortcuts for common simple operation: activations, various reductions needed for normalization, loss functions etc.

Here you get the actual line of code and embed it into a kernel, like here: https://github.com/artyom-beilis/pytorch_dlprim/blob/master/src/pointwise_ops.cpp#L155

Now, I agree that this code can be optimised, but it is also important to remember that since the GPU and CPU executions are asynchronous, as long as you can push kernels to execution queue faster than GPU can run them you don't bottleneck your system.

tangjinchuan commented 7 months ago

I appreciate the smart way you organised the global kernel parameters with different lengths. For me, I would write independent kernels for the above operations as well as prescan and other stuff. May I infer that all the kernels in dlprimitive are fully asynchronous? Is this the reason why I see this dim is restricted to 8D in https://github.com/artyom-beilis/dlprimitives/blob/master/src/kernels/broadcast_dims.h#L52C13-L52C15

Best wishes,

artyom-beilis commented 7 months ago

I would write independent kernels for the above operations

It is a problem if you need to support different kinds of broadcasts/reduces, like A.shape = [1,3,1,10], B.shape = [3,2,1] and you calc A+B -> shape = [1,3,2,10] - this kind of stuff requires dynamic code generation, add reduction over different dimensions like mean(A,dim=(...)) you can't write single kernel for them all.

The independent kernels are mostly for some computationally intensive/non-standard tasks like pooling, convolutions etc. So if there are tasks that can simplify coding :-) I'll do it.

Currently the major bottle neck is actually implementing as many functions as needed and being able to do it simple is major help.

Is this the reason why I see this dim is restricted to 8D

Currently tensor is limited to 8 as constant and so far it was enough.

artyom-beilis commented 7 months ago

And what is more important it allows to guys who aren't familiar with GPU coding write some operations.

tangjinchuan commented 7 months ago

The independent kernels are mostly for some computationally intensive/non-standard tasks like pooling, convolutions etc.

There is another way which uses a fixed-length kernel. All the broadcasting size info regarding dimensions needs to be stored in a host buffer and then uploaded to the device as global mem to the kernel. The kernel is in charge of using those info to index the correct location of Array A and B. As a result, this kernel becomes synchronous due to the fact a blocking memory write needs to be issued to the command queue. This is fine for scientific computing tasks, but for AI especially with multiple in-platform GPUs, I guess your way of asynchronous is more welcome.

tangjinchuan commented 7 months ago

Intel Arc 770.txt as promised. This one really took time for it can freeze while executing some cases.

artyom-beilis commented 7 months ago

Thanks. Impressive GPU.

Can you give clinfo or it?

I wonder if it is possible to enable Winograd on one. It depends on shared memory configuration.

tangjinchuan commented 6 months ago

clinfo770.txt You are welcome.
Platform Name: Intel(R) OpenCL Graphics

tangjinchuan commented 6 months ago

In the meantime, there is a website https://compubench.com/device.jsp?benchmark=compu20d&os=Windows&api=cl&D=Intel%28R%29+Arc%28TM%29+A770+Graphics&testgroup=info which could provide more results for new devices.

artyom-beilis / pytorch_dlprim

OpenCL is 3x slower #10

Expected behaviour

Actual behaviour