artyom-beilis / pytorch_dlprim

DLPrimitives/OpenCL out of tree backend for pytorch
http://blog.dlprimitives.org/
MIT License
227 stars 16 forks source link

Support for MacOS Ventura #25

Open dbl001 opened 1 year ago

dbl001 commented 1 year ago

Will pytorch_dlprim run on MacOS 13.2 Ventura? I have an AMD Radeon Pro 5700XT GPU. Does OpenCL support torch.float64, torch.cfloat data types?

% python collect_env.py 
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: macOS 13.2 (x86_64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: version 3.24.3
Libc version: N/A

Python version: 3.8.15 (default, Nov 10 2022, 13:17:42)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] numpydoc==1.5.0
[pip3] pytorch-lightning==1.6.3
[pip3] pytorch-transformers==1.1.0
[pip3] tntorch==1.0.1
[pip3] torch==1.13.0
[pip3] torchmetrics==0.8.2
[pip3] torchtext==0.14.0
[pip3] torchvision==0.14.0
[conda] mkl                       2022.2.1                 pypi_0    pypi
[conda] mkl-devel                 2022.2.1                 pypi_0    pypi
[conda] mkl-include               2022.2.1                 pypi_0    pypi
[conda] mkl-service               2.4.0            py38h4765b79_0    conda-forge
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] numpy-base                1.23.5           py38hc93c6d9_0  
[conda] numpydoc                  1.5.0            py38hecd8cb5_0  
[conda] pytorch                   1.4.0                   py3.8_0    pytorch
[conda] pytorch-lightning         1.6.3                    pypi_0    pypi
[conda] pytorch-transformers      1.1.0                    pypi_0    pypi
[conda] tntorch                   1.0.1                    pypi_0    pypi
[conda] torch                     1.13.0                   pypi_0    pypi
[conda] torchmetrics              0.8.2                    pypi_0    pypi
[conda] torchtext                 0.14.0                   pypi_0    pypi
[conda] torchvision               0.14.0                   pypi_0    pypi
(base) davidlaxer@x86_64-apple-darwin13 pytorch % 
artyom-beilis commented 1 year ago

I know somebody run dlprimitives on M1, unless there are some bugs it should run.

I suggest try building. Start from dlprimitives to check that backend works and than build pytorch backend. Or you can start directly from pytorch - shouldn't be a big problem.

torch.float64, torch.cfloat data types?

No, only 32 bit float is supported at this point.

dbl001 commented 1 year ago

I installed OpenCL-CLHPP E.g. https://github.com/KhronosGroup/OpenCL-CLHPP

% make test
Running tests...
Test project /Users/davidlaxer/OpenCL-CLHPP/build
      Start  1: test_openclhpp_120
 1/45 Test  #1: test_openclhpp_120 ...............................................................   Passed    0.11 sec
      Start  2: test_openclhpp_120_CL_HPP_ENABLE_EXCEPTIONS
 2/45 Test  #2: test_openclhpp_120_CL_HPP_ENABLE_EXCEPTIONS ......................................   Passed    0.11 sec
      Start  3: test_openclhpp_120_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
 3/45 Test  #3: test_openclhpp_120_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................   Passed    0.07 sec
      Start  4: test_openclhpp_120_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
 4/45 Test  #4: test_openclhpp_120_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ...   Passed    0.07 sec
      Start  5: test_openclhpp_120_CL_HPP_CL_1_2_DEFAULT_BUILD
 5/45 Test  #5: test_openclhpp_120_CL_HPP_CL_1_2_DEFAULT_BUILD ...................................   Passed    0.07 sec
      Start  6: test_openclhpp_120_CL_HPP_USE_CL_DEVICE_FISSION
 6/45 Test  #6: test_openclhpp_120_CL_HPP_USE_CL_DEVICE_FISSION ..................................   Passed    0.07 sec
      Start  7: test_openclhpp_120_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
 7/45 Test  #7: test_openclhpp_120_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR .........................   Passed    0.07 sec
      Start  8: test_openclhpp_120_CL_HPP_USE_CL_SUB_GROUPS_KHR
 8/45 Test  #8: test_openclhpp_120_CL_HPP_USE_CL_SUB_GROUPS_KHR ..................................   Passed    0.07 sec
      Start  9: test_openclhpp_120_CL_HPP_USE_IL_KHR
 9/45 Test  #9: test_openclhpp_120_CL_HPP_USE_IL_KHR .............................................   Passed    0.07 sec
      Start 10: test_openclhpp_200
10/45 Test #10: test_openclhpp_200 ...............................................................   Passed    0.11 sec
      Start 11: test_openclhpp_200_CL_HPP_ENABLE_EXCEPTIONS
11/45 Test #11: test_openclhpp_200_CL_HPP_ENABLE_EXCEPTIONS ......................................   Passed    0.11 sec
      Start 12: test_openclhpp_200_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
12/45 Test #12: test_openclhpp_200_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................   Passed    0.07 sec
      Start 13: test_openclhpp_200_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
13/45 Test #13: test_openclhpp_200_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ...   Passed    0.07 sec
      Start 14: test_openclhpp_200_CL_HPP_CL_1_2_DEFAULT_BUILD
14/45 Test #14: test_openclhpp_200_CL_HPP_CL_1_2_DEFAULT_BUILD ...................................   Passed    0.07 sec
      Start 15: test_openclhpp_200_CL_HPP_USE_CL_DEVICE_FISSION
15/45 Test #15: test_openclhpp_200_CL_HPP_USE_CL_DEVICE_FISSION ..................................   Passed    0.07 sec
      Start 16: test_openclhpp_200_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
16/45 Test #16: test_openclhpp_200_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR .........................   Passed    0.07 sec
      Start 17: test_openclhpp_200_CL_HPP_USE_CL_SUB_GROUPS_KHR
17/45 Test #17: test_openclhpp_200_CL_HPP_USE_CL_SUB_GROUPS_KHR ..................................   Passed    0.07 sec
      Start 18: test_openclhpp_200_CL_HPP_USE_IL_KHR
18/45 Test #18: test_openclhpp_200_CL_HPP_USE_IL_KHR .............................................   Passed    0.07 sec
      Start 19: test_openclhpp_210
19/45 Test #19: test_openclhpp_210 ...............................................................   Passed    0.11 sec
      Start 20: test_openclhpp_210_CL_HPP_ENABLE_EXCEPTIONS
20/45 Test #20: test_openclhpp_210_CL_HPP_ENABLE_EXCEPTIONS ......................................   Passed    0.10 sec
      Start 21: test_openclhpp_210_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
21/45 Test #21: test_openclhpp_210_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................   Passed    0.07 sec
      Start 22: test_openclhpp_210_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
22/45 Test #22: test_openclhpp_210_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ...   Passed    0.07 sec
      Start 23: test_openclhpp_210_CL_HPP_CL_1_2_DEFAULT_BUILD
23/45 Test #23: test_openclhpp_210_CL_HPP_CL_1_2_DEFAULT_BUILD ...................................   Passed    0.07 sec
      Start 24: test_openclhpp_210_CL_HPP_USE_CL_DEVICE_FISSION
24/45 Test #24: test_openclhpp_210_CL_HPP_USE_CL_DEVICE_FISSION ..................................   Passed    0.07 sec
      Start 25: test_openclhpp_210_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
25/45 Test #25: test_openclhpp_210_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR .........................   Passed    0.07 sec
      Start 26: test_openclhpp_210_CL_HPP_USE_CL_SUB_GROUPS_KHR
26/45 Test #26: test_openclhpp_210_CL_HPP_USE_CL_SUB_GROUPS_KHR ..................................   Passed    0.07 sec
      Start 27: test_openclhpp_210_CL_HPP_USE_IL_KHR
27/45 Test #27: test_openclhpp_210_CL_HPP_USE_IL_KHR .............................................   Passed    0.07 sec
      Start 28: test_openclhpp_220
28/45 Test #28: test_openclhpp_220 ...............................................................   Passed    0.12 sec
      Start 29: test_openclhpp_220_CL_HPP_ENABLE_EXCEPTIONS
29/45 Test #29: test_openclhpp_220_CL_HPP_ENABLE_EXCEPTIONS ......................................   Passed    0.10 sec
      Start 30: test_openclhpp_220_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
30/45 Test #30: test_openclhpp_220_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................   Passed    0.07 sec
      Start 31: test_openclhpp_220_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
31/45 Test #31: test_openclhpp_220_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ...   Passed    0.07 sec
      Start 32: test_openclhpp_220_CL_HPP_CL_1_2_DEFAULT_BUILD
32/45 Test #32: test_openclhpp_220_CL_HPP_CL_1_2_DEFAULT_BUILD ...................................   Passed    0.07 sec
      Start 33: test_openclhpp_220_CL_HPP_USE_CL_DEVICE_FISSION
33/45 Test #33: test_openclhpp_220_CL_HPP_USE_CL_DEVICE_FISSION ..................................   Passed    0.07 sec
      Start 34: test_openclhpp_220_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
34/45 Test #34: test_openclhpp_220_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR .........................   Passed    0.07 sec
      Start 35: test_openclhpp_220_CL_HPP_USE_CL_SUB_GROUPS_KHR
35/45 Test #35: test_openclhpp_220_CL_HPP_USE_CL_SUB_GROUPS_KHR ..................................   Passed    0.07 sec
      Start 36: test_openclhpp_220_CL_HPP_USE_IL_KHR
36/45 Test #36: test_openclhpp_220_CL_HPP_USE_IL_KHR .............................................   Passed    0.07 sec
      Start 37: test_openclhpp_300
37/45 Test #37: test_openclhpp_300 ...............................................................   Passed    0.11 sec
      Start 38: test_openclhpp_300_CL_HPP_ENABLE_EXCEPTIONS
38/45 Test #38: test_openclhpp_300_CL_HPP_ENABLE_EXCEPTIONS ......................................   Passed    0.10 sec
      Start 39: test_openclhpp_300_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
39/45 Test #39: test_openclhpp_300_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................   Passed    0.07 sec
      Start 40: test_openclhpp_300_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
40/45 Test #40: test_openclhpp_300_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ...   Passed    0.07 sec
      Start 41: test_openclhpp_300_CL_HPP_CL_1_2_DEFAULT_BUILD
41/45 Test #41: test_openclhpp_300_CL_HPP_CL_1_2_DEFAULT_BUILD ...................................   Passed    0.07 sec
      Start 42: test_openclhpp_300_CL_HPP_USE_CL_DEVICE_FISSION
42/45 Test #42: test_openclhpp_300_CL_HPP_USE_CL_DEVICE_FISSION ..................................   Passed    0.07 sec
      Start 43: test_openclhpp_300_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
43/45 Test #43: test_openclhpp_300_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR .........................   Passed    0.07 sec
      Start 44: test_openclhpp_300_CL_HPP_USE_CL_SUB_GROUPS_KHR
44/45 Test #44: test_openclhpp_300_CL_HPP_USE_CL_SUB_GROUPS_KHR ..................................   Passed    0.07 sec
      Start 45: test_openclhpp_300_CL_HPP_USE_IL_KHR
45/45 Test #45: test_openclhpp_300_CL_HPP_USE_IL_KHR .............................................   Passed    0.07 sec

100% tests passed, 0 tests failed out of 45

Total Test time (real) =   3.47 sec

Set:

$ export OCL_PATH=/Users/davidlaxer/OpenCL-CLHPP/include/CL
$  ls -l $OCL_PATH     
total 664
-rw-r--r--  1 davidlaxer  staff     786 Jan 24 12:17 cl2.hpp
-rw-r--r--  1 davidlaxer  staff  334369 Jan 24 12:17 opencl.hpp

 % env | grep OCL
OCL_PATH=/Users/davidlaxer/OpenCL-CLHPP/include/CL

But I'm getting this build error:

AI-Feynman) davidlaxer@x86_64-apple-darwin13 build % cmake -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DINCLUDE_DIRS="/Users/davidlaxer/OpenCL-CLHPP/include/CL" -DCMAKE_PREFIX_PATH=/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/share/cmake/Torch ..

-- Caffe2: Found protobuf with new-style protobuf targets.
-- Caffe2: Protobuf version 3.20.1
-- MKL_ARCH: intel64
-- MKL_ROOT /opt/intel/oneapi/mkl/2021.3.0
-- MKL_LINK: dynamic
-- MKL_INTERFACE_FULL: intel_ilp64
-- MKL_THREADING: intel_thread
-- MKL_MPI: mpich
CMake Warning at /Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  /Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
  CMakeLists.txt:4 (find_package)

=== Status ===
  OpenCL: include OCL_PATH-NOTFOUND
          lib     /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk/System/Library/Frameworks/OpenCL.framework
  Python: /Users/davidlaxer/anaconda3/envs/AI-Feynman/bin/python3
  BLAS: None
  HDF5: None
  Sqlite3: include /Users/davidlaxer/anaconda3/envs/AI-Feynman/include
           lib /Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/libsqlite3.dylib
  Protobuf (onnx): disabled
  Python dlprim: disabled
-- Configuring done
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
OCL_PATH
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
   used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives

CMake Error in CMakeLists.txt:
  Found relative path while evaluating include directories of "pt_ocl":

    "OCL_PATH-NOTFOUND"

CMake Error in CMakeLists.txt:
  Found relative path while evaluating include directories of "pt_ocl":

    "OCL_PATH-NOTFOUND"

CMake Error in dlprimitives/CMakeLists.txt:
  Found relative path while evaluating include directories of "dlprim_core":

    "OCL_PATH-NOTFOUND"

CMake Error in dlprimitives/CMakeLists.txt:
  Found relative path while evaluating include directories of "dlprim_core":

    "OCL_PATH-NOTFOUND"

-- Generating done
CMake Generate step failed.  Build files cannot be regenerated correctly.

What step am I missing?

dbl001 commented 1 year ago

I 'hacked' /Users/davidlaxer/pytorch_dlprim/dlprimitives/include/dlprim/opencl_include.hpp:

#  ifdef __APPLE__
//#    include <OpenCL/cl2.hpp>
#    include <CL/cl2.hpp>

Ran:

% cmake -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DOCL_PATH="/Users/davidlaxer/OpenCL-CLHPP/include/" -DCMAKE_PREFIX_PATH=/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/share/cmake/Torch ..
$ make
$ make install

It built and installed without errors. When I tried to test:

 % python mnist.py --device ocl:0
Traceback (most recent call last):
  File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 162, in <module>
    main()
  File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 121, in main
    torch.ops.load_library("build/libpt_ocl.so")
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/_ops.py", line 640, in load_library
    ctypes.CDLL(path)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/davidlaxer/pytorch_dlprim/build/libpt_ocl.so, 0x0006): tried: '/Users/davidlaxer/pytorch_dlprim/build/libpt_ocl.so' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/davidlaxer/pytorch_dlprim/build/libpt_ocl.so' (no such file), '/Users/davidlaxer/pytorch_dlprim/build/libpt_ocl.so' (no such file)
(AI-Feynman) davidlaxer@x86_64-apple-darwin13 pytorch_dlprim % find . -name libpt_ocl.so  -ls
artyom-beilis commented 1 year ago
  torch.ops.load_library("build/libpt_ocl.so")

Change it to build/libpt_ocl.dylib - on Mac shared objects called dylib

dbl001 commented 1 year ago
% python mnist.py --device ocl:0
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|████████████████████████████| 9912422/9912422 [00:02<00:00, 3750281.89it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|█████████████████████████████████| 28881/28881 [00:00<00:00, 268680.53it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|████████████████████████████| 1648877/1648877 [00:00<00:00, 2330785.83it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|█████████████████████████████████| 4542/4542 [00:00<00:00, 12141828.41it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw

Using device: ocl:0
Accessing device #0:Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz on Apple
Traceback (most recent call last):
  File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 162, in <module>
    main()
  File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 153, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 53, in train
    output = model(data)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1488, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 29, in forward
    x = self.conv1(x)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1488, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: clEnqueueNDRangeKernel

The 'mnist.py' test program completes with: --device 'mps' and 'cpu'.

% python mnist.py --device mps
Using device: mps
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.326377

...
Train Epoch: 5 [59520/60000 (99%)]  Loss: 0.000508

Epoch in  25.6s

Test set: Average loss: 0.0289, Accuracy: 9911/10000 (99%)

Done
artyom-beilis commented 1 year ago

Looks like first device is actually CPU. You should have another platforms/devices for GPU. Assuming that opencl drivers are installed for the gpu.

What is output of clinfo -l or clinfo

dbl001 commented 1 year ago

Correct!

% python mnist.py --device ocl:1
Using device: ocl:1
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.326377
Train Epoch: 1 [640/60000 (1%)] Loss: 1.373414
Train Epoch: 1 [1280/60000 (2%)]    Loss: 0.674242
Train Epoch: 1 [1920/60000 (3%)]    Loss: 0.342660
...
Train Epoch: 5 [58240/60000 (97%)]  Loss: 0.005476
Train Epoch: 5 [58880/60000 (98%)]  Loss: 0.002447
Train Epoch: 5 [59520/60000 (99%)]  Loss: 0.000584
Epoch in   9.4s

Test set: Average loss: 0.0287, Accuracy: 9900/10000 (99%)

Done

It's faster then 'mps'.

artyom-beilis commented 1 year ago

Great!

What is 'mps' device?

I 'hacked' /Users/davidlaxer/pytorch_dlprim/dlprimitives/include/dlprim/opencl_include.hpp:

#  ifdef __APPLE__
//#    include <OpenCL/cl2.hpp>
#    include <CL/cl2.hpp>

Ohhh... that is interesting. Probably I'll need to add special case for header detection of both OpenCL/cl2.hpp and CL/cl2.hpp

Thanks.

Can you build the dlprimitives (outside the pytorch) and run some tests to see that it works properly

(Note you'll probably need to set cmake parameter something like TEST_DEV=1:0 for platform 1 device 0 (according to clinfo -l)

dbl001 commented 1 year ago

'MPS' is Apple's Metal Performance Shader framework. PyTorch now supports 'MPS' as a backend (e.g. - CUDA, MPS, ...)

https://pytorch.org/docs/stable/notes/mps.html

How do I run test.py? In test.py, when I set device='opencl:1', I get this exception:

% python test.py
Traceback (most recent call last):
  File "/Users/davidlaxer/pytorch_dlprim/test.py", line 32, in <module>
    grid_dev = grid_src.detach().clone().to(dev)
RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/davidlaxer/pytorch/c10/core/TensorOptions.h":659, please report a bug to PyTorch. This is a grandfathered Caffe2 device type opencl, it shouldn't ever convert to a DispatchKey.  File a bug describing what you were doing if you think this is in error.

From iPython:

% ipython
Python 3.10.9 (main, Jan 11 2023, 09:18:20) [Clang 14.0.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: import numpy as np

In [3]: probs = torch.tensor(np.loadtxt("/Users/davidlaxer/minGPT/probs0.txt"),
   ...: dtype=torch.float32, device='opencl:1')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 1
----> 1 probs = torch.tensor(np.loadtxt("/Users/davidlaxer/minGPT/probs0.txt"), dtype=torch.float32, device='opencl:1')

RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/davidlaxer/pytorch/c10/core/TensorOptions.h":659, please report a bug to PyTorch. This is a grandfathered Caffe2 device type opencl, it shouldn't ever convert to a DispatchKey.  File a bug describing what you were doing if you think this is in error.
dbl001 commented 1 year ago
% ./clinfo
Number of platforms                               1
  Platform Name                                   Apple
  Platform Vendor                                 Apple
  Platform Version                                OpenCL 1.2 (Dec 16 2022 20:35:20)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event

  Platform Name                                   Apple
Number of devices                                 2
  Device Name                                     Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
  Device Vendor                                   Intel
  Device Vendor ID                                0xffffffff
  Device Version                                  OpenCL 1.2 
  Driver Version                                  1.1
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               16
  Max clock frequency                             3800MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1x1
  Max work group size                             1024
  Preferred work group size multiple (kernel)     1
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 0        (n/a)
    float                                                4 / 4       
    double                                               2 / 2        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              137438953472 (128GiB)
  Error Correction support                        No
  Max memory allocation                           34359738368 (32GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        64
  Global Memory cache line size                   16777216 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   1 bytes
    Pitch alignment for 2D image buffers          1 pixels
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     8
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4096 (4KiB)
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_APPLE_fp64_basic_ops cl_APPLE_fixed_alpha_channel_orders cl_APPLE_biased_fixed_point_image_formats cl_APPLE_command_queue_priority

  Device Name                                     AMD Radeon Pro 5700 XT Compute Engine
  Device Vendor                                   AMD
  Device Vendor ID                                0x1021e00
  Device Version                                  OpenCL 1.2 
  Driver Version                                  1.2 (Jan  6 2023 19:45:55)
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               40
  Max clock frequency                             1499MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple (kernel)     32
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                2 / 2       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    32, Little-Endian
  Global memory size                              17163091968 (15.98GiB)
  Error Correction support                        No
  Max memory allocation                           4290772992 (3.996GiB)
  Unified memory for Host and Device              No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       32768 bits (4096 bytes)
  Global Memory cache type                        None
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Max number of constant args                     8
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      10ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            134217728 (128MiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_APPLE_command_queue_priority cl_APPLE_command_queue_select_compute_units cl_khr_fp64

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Apple
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [P0]
  clCreateContext(NULL, ...) [default]            Success [P0]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Apple
    Device Name                                   AMD Radeon Pro 5700 XT Compute Engine
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  Success (1)
    Platform Name                                 Apple
    Device Name                                   Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Apple
    Device Name                                   AMD Radeon Pro 5700 XT Compute Engine
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (2)
    Platform Name                                 Apple
    Device Name                                   AMD Radeon Pro 5700 XT Compute Engine
    Device Name                                   Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
(base) davidlaxer@x86_64-apple-darwin13 clinfo % 
artyom-beilis commented 1 year ago

'MPS' is Apple's Metal Performance Shader framework. PyTorch now supports 'MPS' as a backend (e.g. - CUDA, MPS, ...)

Interesting. Can you run some benchmarks of opencl vs metal backend, here some examples:

python dlprimitives/tools/validate_network.py --device privateuseone:0 --benchmark --train --model resnet50 --batch 32
python dlprimitives/tools/validate_network.py --device privateuseone:0 --benchmark --model resnet50 --batch 32

Check please for variants:

--model resnet18 --batch 64
--model mobilenet_v2 --batch 64
--model alexnet --batch 64

And of course for mps for comparison - it would be highly interesting how my result is compared to apples Metal results.

In test.py, when I set device='opencl:1',

Because once opencl support was planned and opencl is reserved device type - but it never realised. For out of tree backend I can use privateuseone device that I can rename to 'ocl' as well. But opencl is reserved.

dbl001 commented 1 year ago
% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --train --model resnet50 --batch 32

/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 17386.135ms  warming up
Step -4 969.884ms  
Step -3 980.468ms  
Step -2 982.220ms  
Step -1 975.265ms  
Step  0 975.861ms  started
Step  1 980.682ms  
Step  2 983.092ms  
Step  3 981.701ms  
Step  4 982.421ms  
Step  5 979.700ms  
Step  6 974.858ms  
Step  7 979.512ms  
Step  8 976.215ms  
Step  9 975.671ms  
Step 10 975.299ms  
Step 11 977.323ms  
Step 12 977.699ms  
Step 13 972.227ms  
Step 14 975.673ms  
Step 15 984.858ms  
Step 16 980.309ms  
Step 17 980.738ms  
Step 18 976.323ms  
Step 19 976.702ms  
Time per item  30.573 ms
Time fwd batch  213.335 ms
Time bwd batch  765.008 ms
Time io  batch  3.127 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 978.343 ms

% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --model resnet50 --batch 32

/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 192.147ms  warming up
Step -4 184.328ms  
Step -3 187.661ms  
Step -2 186.471ms  
Step -1 186.116ms  
Step  0 186.624ms  started
Step  1 186.961ms  
Step  2 185.654ms  
Step  3 186.042ms  
Step  4 186.223ms  
Step  5 186.441ms  
Step  6 185.965ms  
Step  7 186.380ms  
Step  8 184.538ms  
Step  9 186.350ms  
Step 10 186.702ms  
Step 11 186.063ms  
Step 12 185.345ms  
Step 13 184.857ms  
Step 14 185.585ms  
Step 15 186.010ms  
Step 16 185.542ms  
Step 17 186.767ms  
Step 18 186.188ms  
Step 19 186.086ms  
Time per item  5.813 ms
Time per batch 186.016 ms

% python dlprimitives/tools/validate_network.py --device mps  --benchmark --train --model resnet50 --batch 32 

/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Warming up
Step -5 10277.323ms  warming up
Step -4 401.347ms  
Step -3 522.126ms  
Step -2 661.028ms  
Step -1 659.682ms  
Step  0 656.417ms  started
Step  1 656.925ms  
Step  2 656.745ms  
Step  3 654.513ms  
Step  4 657.099ms  
Step  5 657.073ms  
Step  6 655.817ms  
Step  7 657.810ms  
Step  8 653.340ms  
Step  9 654.563ms  
Step 10 660.629ms  
Step 11 655.051ms  
Step 12 660.242ms  
Step 13 654.396ms  
Step 14 658.860ms  
Step 15 656.098ms  
Step 16 654.666ms  
Step 17 653.854ms  
Step 18 657.085ms  
Step 19 655.118ms  
Time per item  20.510 ms
Time fwd batch  570.076 ms
Time bwd batch  86.239 ms
Time io  batch  515.912 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 656.315 ms

 % python dlprimitives/tools/validate_network.py --device mps --benchmark --model resnet50 --batch 32

/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Warming up
Step -5 882.587ms  warming up
Step -4 86.313ms  
Step -3 83.462ms  
Step -2 83.408ms  
Step -1 82.453ms  
Step  0 82.748ms  started
Step  1 82.843ms  
Step  2 82.410ms  
Step  3 82.929ms  
Step  4 82.597ms  
Step  5 82.722ms  
Step  6 82.410ms  
Step  7 82.069ms  
Step  8 83.158ms  
Step  9 81.898ms  
Step 10 83.134ms  
Step 11 83.021ms  
Step 12 83.030ms  
Step 13 82.746ms  
Step 14 82.470ms  
Step 15 82.466ms  
Step 16 82.949ms  
Step 17 83.567ms  
Step 18 83.144ms  
Step 19 83.164ms  
Time per item  2.587 ms
Time per batch 82.774 ms

% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --train --model alexnet --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /Users/davidlaxer/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|████████████████████████████████████████| 233M/233M [01:01<00:00, 4.00MB/s]
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 5249.255ms  warming up
Step -4 214.265ms  
Step -3 212.626ms  
Step -2 212.994ms  
Step -1 212.822ms  
Step  0 213.492ms  started
Step  1 214.159ms  
Step  2 213.421ms  
Step  3 212.886ms  
Step  4 212.985ms  
Step  5 213.019ms  
Step  6 214.607ms  
Step  7 213.117ms  
Step  8 212.957ms  
Step  9 213.335ms  
Step 10 213.071ms  
Step 11 213.128ms  
Step 12 212.807ms  
Step 13 212.353ms  
Step 14 212.722ms  
Step 15 213.727ms  
Step 16 213.121ms  
Step 17 212.917ms  
Step 18 212.744ms  
Step 19 213.406ms  
Time per item  3.331 ms
Time fwd batch  53.570 ms
Time bwd batch  159.629 ms
Time io  batch  4.734 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 213.199 ms
(AI-Feynman) davidlaxer@x86_64-apple-darwin13 pytorch_dlprim % python dlprimitives/tools/validate_network.py --device mps --benchmark --train --model alexnet --batch 64   
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Warming up
Step -5 1817.901ms  warming up
Step -4 159.780ms  
Step -3 130.040ms  
Step -2 170.132ms  
Step -1 172.625ms  
Step  0 173.081ms  started
Step  1 169.357ms  
Step  2 172.887ms  
Step  3 173.597ms  
Step  4 172.193ms  
Step  5 172.748ms  
Step  6 172.532ms  
Step  7 171.034ms  
Step  8 173.197ms  
Step  9 174.854ms  
Step 10 169.927ms  
Step 11 172.000ms  
Step 12 171.020ms  
Step 13 173.021ms  
Step 14 173.242ms  
Step 15 175.806ms  
Step 16 172.851ms  
Step 17 170.872ms  
Step 18 172.756ms  
Step 19 170.169ms  
Time per item  2.693 ms
Time fwd batch  161.963 ms
Time bwd batch  10.394 ms
Time io  batch  155.241 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 172.357 ms

% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --train --model mobilenet_v2 --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MobileNet_V2_Weights.IMAGENET1K_V1`. You can also use `weights=MobileNet_V2_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /Users/davidlaxer/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
100%|██████████████████████████████████████| 13.6M/13.6M [00:03<00:00, 4.05MB/s]
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 13280.064ms  warming up
Step -4 752.361ms  
Step -3 739.524ms  
Step -2 745.049ms  
Step -1 745.150ms  
Step  0 744.964ms  started
Step  1 744.489ms  
Step  2 742.961ms  
Step  3 746.302ms  
Step  4 743.279ms  
Step  5 737.969ms  
Step  6 744.280ms  
Step  7 742.577ms  
Step  8 741.941ms  
Step  9 746.158ms  
Step 10 744.912ms  
Step 11 740.043ms  
Step 12 737.633ms  
Step 13 741.226ms  
Step 14 741.452ms  
Step 15 741.393ms  
Step 16 739.347ms  
Step 17 740.438ms  
Step 18 741.509ms  
Step 19 740.733ms  
Time per item  11.597 ms
Time fwd batch  161.055 ms
Time bwd batch  581.125 ms
Time io  batch  5.534 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 742.180 ms
(AI-Feynman) davidlaxer@x86_64-apple-darwin13 pytorch_dlprim % python dlprimitives/tools/validate_network.py --device mps --benchmark --train --model mobilenet_v2 --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MobileNet_V2_Weights.IMAGENET1K_V1`. You can also use `weights=MobileNet_V2_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Warming up
Step -5 9177.353ms  warming up
Step -4 911.196ms  
Step -3 2079.731ms  
Step -2 2267.814ms  
Step -1 2244.676ms  
Step  0 2275.510ms  started
Step  1 2261.282ms  
Step  2 2257.297ms  
Step  3 2263.499ms  
Step  4 2258.673ms  
Step  5 2266.446ms  
Step  6 2280.577ms  
Step  7 2253.364ms  
Step  8 2246.077ms  
Step  9 2288.277ms  
Step 10 2287.250ms  
Step 11 2255.581ms  
Step 12 2267.239ms  
Step 13 2268.039ms  
Step 14 2283.474ms  
Step 15 2269.081ms  
Step 16 2266.289ms  
Step 17 2273.217ms  
Step 18 2283.813ms  
Step 19 2276.918ms  
Time per item  35.455 ms
Time fwd batch  2192.135 ms
Time bwd batch  76.960 ms
Time io  batch  2137.610 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 2269.095 ms

% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --train --model resnet18 --batch 64           

/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /Users/davidlaxer/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████████████████████████████████| 44.7M/44.7M [00:11<00:00, 4.00MB/s]
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 5812.478ms  warming up
Step -4 862.062ms  
Step -3 863.053ms  
Step -2 862.100ms  
Step -1 861.751ms  
Step  0 859.801ms  started
Step  1 861.398ms  
Step  2 861.833ms  
Step  3 858.887ms  
Step  4 861.570ms  
Step  5 860.864ms  
Step  6 861.934ms  
Step  7 862.494ms  
Step  8 864.035ms  
Step  9 858.804ms  
Step 10 856.521ms  
Step 11 856.596ms  
Step 12 863.216ms  
Step 13 861.385ms  
Step 14 861.439ms  
Step 15 859.234ms  
Step 16 860.556ms  
Step 17 861.595ms  
Step 18 863.927ms  
Step 19 860.451ms  
Time per item  13.450 ms
Time fwd batch  166.798 ms
Time bwd batch  694.029 ms
Time io  batch  5.395 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 860.827 ms
(AI-Feynman) davidlaxer@x86_64-apple-darwin13 pytorch_dlprim % python dlprimitives/tools/validate_network.py --device mps --benchmark --train --model resnet18 --batch 64           

/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Warming up
Step -5 2292.475ms  warming up
Step -4 407.617ms  
Step -3 539.161ms  
Step -2 589.164ms  
Step -1 591.197ms  
Step  0 588.727ms  started
Step  1 590.627ms  
Step  2 598.048ms  
Step  3 590.203ms  
Step  4 587.857ms  
Step  5 590.700ms  
Step  6 589.184ms  
Step  7 589.716ms  
Step  8 597.186ms  
Step  9 590.756ms  
Step 10 595.828ms  
Step 11 598.534ms  
Step 12 593.455ms  
Step 13 596.133ms  
Step 14 588.126ms  
Step 15 595.137ms  
Step 16 591.288ms  
Step 17 590.055ms  
Step 18 592.019ms  
Step 19 582.737ms  
Time per item  9.247 ms
Time fwd batch  560.283 ms
Time bwd batch  31.532 ms
Time io  batch  537.801 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 591.816 ms
dbl001 commented 1 year ago

There were a two errors bulding dlprimitives error in

vi +112 /Users/davidlaxer/pytorch_dlprim/dlprimitives/src/importers/onnx.cpp

 vi +50 /Users/davidlaxer/anaconda3/envs/AI-Feynman/include/boost/python/object/make_instance.hpp

Here are the results of running $make test

Running tests...
/opt/local/bin/ctest --force-new-ctest-process 
Test project /Users/davidlaxer/pytorch_dlprim/dlprimitives/build
      Start  1: test_test_case_abs
 1/33 Test  #1: test_test_case_abs ...............   Passed    1.12 sec
      Start  2: test_test_case_activation
 2/33 Test  #2: test_test_case_activation ........***Failed    0.09 sec
      Start  3: test_test_case_batchnorm
 3/33 Test  #3: test_test_case_batchnorm .........***Failed    0.42 sec
      Start  4: test_test_case_concat
 4/33 Test  #4: test_test_case_concat ............   Passed    0.09 sec
      Start  5: test_test_case_conv2d
 5/33 Test  #5: test_test_case_conv2d ............***Failed    0.77 sec
      Start  6: test_test_case_conv2d_dsc
 6/33 Test  #6: test_test_case_conv2d_dsc ........***Failed    0.41 sec
      Start  7: test_test_case_conv2d_gemm
 7/33 Test  #7: test_test_case_conv2d_gemm .......***Failed    0.42 sec
      Start  8: test_test_case_conv2d_win
 8/33 Test  #8: test_test_case_conv2d_win ........***Failed    0.42 sec
      Start  9: test_test_case_elementwise
 9/33 Test  #9: test_test_case_elementwise .......***Failed    0.53 sec
      Start 10: test_test_case_global_pooling
10/33 Test #10: test_test_case_global_pooling ....***Failed    0.12 sec
      Start 11: test_test_case_hardtanh
11/33 Test #11: test_test_case_hardtanh ..........   Passed    0.63 sec
      Start 12: test_test_case_inner_product
12/33 Test #12: test_test_case_inner_product .....***Failed    0.20 sec
      Start 13: test_test_case_log_softmax
13/33 Test #13: test_test_case_log_softmax .......***Failed    0.09 sec
      Start 14: test_test_case_mse_loss
14/33 Test #14: test_test_case_mse_loss ..........   Passed    0.27 sec
      Start 15: test_test_case_nll_loss
15/33 Test #15: test_test_case_nll_loss ..........***Failed    0.24 sec
      Start 16: test_test_case_param
16/33 Test #16: test_test_case_param .............   Passed    0.11 sec
      Start 17: test_test_case_pooling2d
17/33 Test #17: test_test_case_pooling2d .........***Failed    0.08 sec
      Start 18: test_test_case_reduction
18/33 Test #18: test_test_case_reduction .........***Failed    0.18 sec
      Start 19: test_test_case_slice
19/33 Test #19: test_test_case_slice .............***Failed    0.05 sec
      Start 20: test_test_case_softmax
20/33 Test #20: test_test_case_softmax ...........***Failed    0.09 sec
      Start 21: test_test_case_softmax_loss
21/33 Test #21: test_test_case_softmax_loss ......***Failed    0.15 sec
      Start 22: test_test_case_threshold
22/33 Test #22: test_test_case_threshold .........   Passed    0.56 sec
      Start 23: test_test_case_tr_conv2d
23/33 Test #23: test_test_case_tr_conv2d .........***Failed    0.70 sec
      Start 24: test_test_case_tr_conv2d_dsc
24/33 Test #24: test_test_case_tr_conv2d_dsc .....***Failed    0.66 sec
      Start 25: test_test_case_tr_conv2d_gemm
25/33 Test #25: test_test_case_tr_conv2d_gemm ....***Failed    0.65 sec
      Start 26: test_test_case_tr_conv2d_win
26/33 Test #26: test_test_case_tr_conv2d_win .....***Failed    0.66 sec
      Start 27: test_net
27/33 Test #27: test_net .........................Subprocess aborted***Exception:   0.28 sec
      Start 28: test_net_nonopt
28/33 Test #28: test_net_nonopt ..................Subprocess aborted***Exception:   0.05 sec
      Start 29: test_json
29/33 Test #29: test_json ........................   Passed    0.13 sec
      Start 30: test_random
30/33 Test #30: test_random ......................   Passed    0.30 sec
      Start 31: test_context
31/33 Test #31: test_context .....................   Passed    0.18 sec
      Start 32: test_util
32/33 Test #32: test_util ........................***Failed    0.30 sec
      Start 33: test_broadcast_reduce
33/33 Test #33: test_broadcast_reduce ............***Failed    0.17 sec

27% tests passed, 24 tests failed out of 33

Total Test time (real) =  11.08 sec

The following tests FAILED:
      2 - test_test_case_activation (Failed)
      3 - test_test_case_batchnorm (Failed)
      5 - test_test_case_conv2d (Failed)
      6 - test_test_case_conv2d_dsc (Failed)
      7 - test_test_case_conv2d_gemm (Failed)
      8 - test_test_case_conv2d_win (Failed)
      9 - test_test_case_elementwise (Failed)
     10 - test_test_case_global_pooling (Failed)
     12 - test_test_case_inner_product (Failed)
     13 - test_test_case_log_softmax (Failed)
     15 - test_test_case_nll_loss (Failed)
     17 - test_test_case_pooling2d (Failed)
     18 - test_test_case_reduction (Failed)
     19 - test_test_case_slice (Failed)
     20 - test_test_case_softmax (Failed)
     21 - test_test_case_softmax_loss (Failed)
     23 - test_test_case_tr_conv2d (Failed)
     24 - test_test_case_tr_conv2d_dsc (Failed)
     25 - test_test_case_tr_conv2d_gemm (Failed)
     26 - test_test_case_tr_conv2d_win (Failed)
     27 - test_net (Subprocess aborted)
     28 - test_net_nonopt (Subprocess aborted)
     32 - test_util (Failed)
     33 - test_broadcast_reduce (Failed)
Errors while running CTest
Output from these tests are in: /Users/davidlaxer/pytorch_dlprim/dlprimitives/build/Testing/Temporary/LastTest.log
[LastTest.log](https://github.com/artyom-beilis/pytorch_dlprim/files/10512218/LastTest.log)
[LastTest.log](https://github.com/artyom-beilis/pytorch_dlprim/files/10512218/LastTest.log)
dbl001 commented 1 year ago

LastTest.log.gz

dbl001 commented 1 year ago
% dlprim_flops 0:1 0.5
Testing on AMD Radeon Pro 5700 XT Compute Engine on Apple
Testing memory speed
- Vector size 1
-- Warming 
-- Running   28.4175 GB/s
- Vector size 2
-- Warming 
-- Running   46.5139 GB/s
- Vector size 4
-- Warming 
-- Running   202.604 GB/s
- Vector size 8
-- Warming 
-- Running   207.637 GB/s
- Vector size 16
-- Warming 
-- Running   192.703 GB/s
Testing flops float
- Vector size 1
-- Warming 
-- Running   3318.61 GFlops
- Vector size 2
-- Warming 
-- Running   3315.52 GFlops
- Vector size 4
-- Warming 
-- Running   3244.98 GFlops
- Vector size 8
-- Warming 
-- Running   2923.09 GFlops
- Vector size 16
-- Warming 
-- Running   2579.09 GFlops
Summray for AMD Radeon Pro 5700 XT Compute Engine on Apple
Peak GFlops for float 3318.61
Peak memory 207.637 GB/s
GEMM
  NN  0:  512,  512,  512      762.1 GFlops (22.97%)      8.9 GB/s ( 4.64%) limited by gflops 22.97%
  NN  1: 1024, 1024, 1024     2067.6 GFlops (62.30%)     12.1 GB/s ( 6.29%) limited by gflops 62.30%
  NN  2: 1025, 1025, 1025     1588.5 GFlops (47.87%)      9.3 GB/s ( 4.83%) limited by gflops 47.87%
  NN  3: 2048, 2048, 2048     2503.3 GFlops (75.43%)      7.3 GB/s ( 3.81%) limited by gflops 75.43%
  NN  4: 2049, 2049, 2049     2505.1 GFlops (75.49%)      7.3 GB/s ( 3.81%) limited by gflops 75.49%
  NN  5:   64, 2048,   64      422.1 GFlops (12.72%)     27.0 GB/s (14.01%) limited by memory 14.01%
  NN  6: 2048,   64, 2048     1406.6 GFlops (42.39%)     46.7 GB/s (24.24%) limited by gflops 42.39%
  NN  7: 2048, 2048,   64     1159.8 GFlops (34.95%)     38.8 GB/s (20.14%) limited by gflops 34.95%
  NN  8: 2048,   64,   64      404.9 GFlops (12.20%)     25.9 GB/s (13.44%) limited by memory 13.44%
  NN  9:   64, 2048, 2048     1786.9 GFlops (53.84%)     59.3 GB/s (30.80%) limited by gflops 53.84%
  NN 10:   64,   64, 2048       91.1 GFlops ( 2.75%)      5.8 GB/s ( 3.00%) limited by memory  3.00%
  NT  0:  512,  512,  512      712.3 GFlops (21.46%)      8.4 GB/s ( 4.34%) limited by gflops 21.46%
  NT  1: 1024, 1024, 1024     1767.3 GFlops (53.25%)     10.4 GB/s ( 5.38%) limited by gflops 53.25%
  NT  2: 1025, 1025, 1025     1589.2 GFlops (47.89%)      9.3 GB/s ( 4.83%) limited by gflops 47.89%
  NT  3: 2048, 2048, 2048     2214.6 GFlops (66.73%)      6.5 GB/s ( 3.37%) limited by gflops 66.73%
  NT  4: 2049, 2049, 2049     2524.4 GFlops (76.07%)      7.4 GB/s ( 3.84%) limited by gflops 76.07%
  NT  5:   64, 2048,   64      452.6 GFlops (13.64%)     29.0 GB/s (15.03%) limited by memory 15.03%
  NT  6: 2048,   64, 2048     1200.0 GFlops (36.16%)     39.9 GB/s (20.68%) limited by gflops 36.16%
  NT  7: 2048, 2048,   64     1136.3 GFlops (34.24%)     38.0 GB/s (19.73%) limited by gflops 34.24%
  NT  8: 2048,   64,   64      439.5 GFlops (13.24%)     28.1 GB/s (14.59%) limited by memory 14.59%
  NT  9:   64, 2048, 2048     1463.2 GFlops (44.09%)     48.6 GB/s (25.22%) limited by gflops 44.09%
  NT 10:   64,   64, 2048       80.0 GFlops ( 2.41%)      5.1 GB/s ( 2.64%) limited by memory  2.64%
  TN  0:  512,  512,  512      877.7 GFlops (26.45%)     10.3 GB/s ( 5.34%) limited by gflops 26.45%
  TN  1: 1024, 1024, 1024     2222.2 GFlops (66.96%)     13.0 GB/s ( 6.76%) limited by gflops 66.96%
  TN  2: 1025, 1025, 1025     1559.0 GFlops (46.98%)      9.1 GB/s ( 4.74%) limited by gflops 46.98%
  TN  3: 2048, 2048, 2048     2737.5 GFlops (82.49%)      8.0 GB/s ( 4.16%) limited by gflops 82.49%
  TN  4: 2049, 2049, 2049     2476.7 GFlops (74.63%)      7.3 GB/s ( 3.76%) limited by gflops 74.63%
  TN  5:   64, 2048,   64      414.2 GFlops (12.48%)     26.5 GB/s (13.75%) limited by memory 13.75%
  TN  6: 2048,   64, 2048     1805.5 GFlops (54.41%)     60.0 GB/s (31.12%) limited by gflops 54.41%
  TN  7: 2048, 2048,   64     1160.7 GFlops (34.98%)     38.8 GB/s (20.16%) limited by gflops 34.98%
  TN  8: 2048,   64,   64      385.2 GFlops (11.61%)     24.6 GB/s (12.79%) limited by memory 12.79%
  TN  9:   64, 2048, 2048     1840.2 GFlops (55.45%)     61.1 GB/s (31.71%) limited by gflops 55.45%
  TN 10:   64,   64, 2048       97.0 GFlops ( 2.92%)      6.2 GB/s ( 3.20%) limited by memory  3.20%
  TT  0:  512,  512,  512      797.1 GFlops (24.02%)      9.4 GB/s ( 4.85%) limited by gflops 24.02%
  TT  1: 1024, 1024, 1024     2115.2 GFlops (63.74%)     12.4 GB/s ( 6.43%) limited by gflops 63.74%
  TT  2: 1025, 1025, 1025     1583.3 GFlops (47.71%)      9.3 GB/s ( 4.81%) limited by gflops 47.71%
  TT  3: 2048, 2048, 2048     2633.0 GFlops (79.34%)      7.7 GB/s ( 4.00%) limited by gflops 79.34%
  TT  4: 2049, 2049, 2049     2514.8 GFlops (75.78%)      7.4 GB/s ( 3.82%) limited by gflops 75.78%
  TT  5:   64, 2048,   64      432.7 GFlops (13.04%)     27.7 GB/s (14.37%) limited by memory 14.37%
  TT  6: 2048,   64, 2048     1728.6 GFlops (52.09%)     57.4 GB/s (29.79%) limited by gflops 52.09%
  TT  7: 2048, 2048,   64     1154.7 GFlops (34.79%)     38.6 GB/s (20.05%) limited by gflops 34.79%
  TT  8: 2048,   64,   64      425.8 GFlops (12.83%)     27.2 GB/s (14.14%) limited by memory 14.14%
  TT  9:   64, 2048, 2048     1492.1 GFlops (44.96%)     49.6 GB/s (25.71%) limited by gflops 44.96%
  TT 10:   64,   64, 2048       84.6 GFlops ( 2.55%)      5.4 GB/s ( 2.79%) limited by memory  2.79%
Convolution
   0     effnet  forward b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14      247.8 GFlops ( 7.47%)    110.2 GB/s (57.18%) limited by memory 57.18% algo=depthwise_separable
   0     effnet bwd-data b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14       98.2 GFlops ( 2.96%)     43.7 GB/s (22.65%) limited by memory 22.65% algo=depthwise_separable
   0     effnet bwd-filt b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14       12.8 GFlops ( 0.38%)      5.7 GB/s ( 2.94%) limited by memory  2.94% algo=depthwise_separable
   1    alexnet  forward b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224    1530.8 GFlops (46.13%)     15.0 GB/s ( 7.79%) limited by gflops 46.13% algo=gemm
   1    alexnet bwd-data b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224    1071.8 GFlops (32.30%)     10.5 GB/s ( 5.45%) limited by gflops 32.30% algo=gemm
   1    alexnet bwd-filt b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224     204.1 GFlops ( 6.15%)      2.0 GB/s ( 1.04%) limited by gflops  6.15% algo=gemm
   2    alexnet  forward b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     1615.8 GFlops (48.69%)      4.1 GB/s ( 2.13%) limited by gflops 48.69% algo=gemm
   2    alexnet bwd-data b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     1788.1 GFlops (53.88%)      4.5 GB/s ( 2.36%) limited by gflops 53.88% algo=gemm
   2    alexnet bwd-filt b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     1006.9 GFlops (30.34%)      2.6 GB/s ( 1.35%) limited by gflops 30.34% algo=gemm
   3    alexnet  forward b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     2052.4 GFlops (61.85%)      3.5 GB/s ( 1.82%) limited by gflops 61.85% algo=gemm
   3    alexnet bwd-data b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     2349.6 GFlops (70.80%)      4.0 GB/s ( 2.08%) limited by gflops 70.80% algo=gemm
   3    alexnet bwd-filt b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27      917.5 GFlops (27.65%)      1.6 GB/s ( 0.83%) limited by gflops 27.65% algo=gemm
   4    alexnet  forward b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     2039.5 GFlops (61.46%)      3.3 GB/s ( 1.73%) limited by gflops 61.46% algo=gemm
   4    alexnet bwd-data b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     2539.4 GFlops (76.52%)      4.1 GB/s ( 2.15%) limited by gflops 76.52% algo=gemm
   4    alexnet bwd-filt b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     1459.0 GFlops (43.96%)      2.7 GB/s ( 1.38%) limited by gflops 43.96% algo=gemm
   5     resnet  forward b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224    1500.8 GFlops (45.22%)     24.3 GB/s (12.59%) limited by gflops 45.22% algo=gemm
   5     resnet bwd-data b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224    1095.0 GFlops (33.00%)     17.7 GB/s ( 9.18%) limited by gflops 33.00% algo=gemm
   5     resnet bwd-filt b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224      89.0 GFlops ( 2.68%)      1.4 GB/s ( 0.75%) limited by gflops  2.68% algo=gemm
   6     resnet  forward b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     1753.6 GFlops (52.84%)     68.5 GB/s (35.56%) limited by gflops 52.84% algo=gemm
   6     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     2429.6 GFlops (73.21%)     94.9 GB/s (49.26%) limited by gflops 73.21% algo=gemm
   6     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56      320.3 GFlops ( 9.65%)     12.5 GB/s ( 6.50%) limited by gflops  9.65% algo=gemm
   7     resnet  forward b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56     1768.9 GFlops (53.30%)    110.6 GB/s (57.38%) limited by memory 57.38% algo=gemm
   7     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56     1308.3 GFlops (39.42%)     81.8 GB/s (42.44%) limited by memory 42.44% algo=gemm
   7     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56       79.5 GFlops ( 2.39%)      5.0 GB/s ( 2.58%) limited by memory  2.58% algo=gemm
   8     resnet  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56     1593.1 GFlops (48.01%)     11.1 GB/s ( 5.75%) limited by gflops 48.01% algo=gemm
   8     resnet bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56     1387.4 GFlops (41.81%)      9.6 GB/s ( 5.01%) limited by gflops 41.81% algo=gemm
   8     resnet bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56      357.9 GFlops (10.78%)      2.5 GB/s ( 1.29%) limited by gflops 10.78% algo=gemm
   9     resnet  forward b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     1816.9 GFlops (54.75%)     11.8 GB/s ( 6.13%) limited by gflops 54.75% algo=gemm
   9     resnet bwd-data b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     2231.4 GFlops (67.24%)     14.5 GB/s ( 7.52%) limited by gflops 67.24% algo=gemm
   9     resnet bwd-filt b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     1888.2 GFlops (56.90%)     13.5 GB/s ( 6.99%) limited by gflops 56.90% algo=gemm
  10     resnet  forward b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14     2527.9 GFlops (76.17%)     25.1 GB/s (13.02%) limited by gflops 76.17% algo=gemm
  10     resnet bwd-data b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14     2365.8 GFlops (71.29%)     23.5 GB/s (12.18%) limited by gflops 71.29% algo=gemm
  10     resnet bwd-filt b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14      878.5 GFlops (26.47%)      8.9 GB/s ( 4.60%) limited by gflops 26.47% algo=gemm
  11     resnet  forward b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     2264.9 GFlops (68.25%)      4.3 GB/s ( 2.23%) limited by gflops 68.25% algo=gemm
  11     resnet bwd-data b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     2451.8 GFlops (73.88%)      4.6 GB/s ( 2.41%) limited by gflops 73.88% algo=gemm
  11     resnet bwd-filt b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     1604.9 GFlops (48.36%)      3.3 GB/s ( 1.71%) limited by gflops 48.36% algo=gemm
  12        vgg  forward b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224    1078.7 GFlops (32.51%)     83.7 GB/s (43.41%) limited by memory 43.41% algo=gemm
  12        vgg bwd-data b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224     463.6 GFlops (13.97%)     36.0 GB/s (18.66%) limited by memory 18.66% algo=gemm
  12        vgg bwd-filt b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224      33.4 GFlops ( 1.01%)      2.6 GB/s ( 1.34%) limited by memory  1.34% algo=gemm
  13        vgg  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224    1574.0 GFlops (47.43%)     10.9 GB/s ( 5.67%) limited by gflops 47.43% algo=gemm
  13        vgg bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224    1069.4 GFlops (32.22%)      7.4 GB/s ( 3.85%) limited by gflops 32.22% algo=gemm
  13        vgg bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224     342.8 GFlops (10.33%)      2.4 GB/s ( 1.24%) limited by gflops 10.33% algo=gemm
  14        vgg  forward b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28     2474.8 GFlops (74.57%)      2.2 GB/s ( 1.17%) limited by gflops 74.57% algo=gemm
  14        vgg bwd-data b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28     2835.9 GFlops (85.46%)      2.6 GB/s ( 1.34%) limited by gflops 85.46% algo=gemm
  14        vgg bwd-filt b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28     1990.9 GFlops (59.99%)      1.9 GB/s ( 0.98%) limited by gflops 59.99% algo=gemm
  15     mobile  forward b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224     985.9 GFlops (29.71%)    100.4 GB/s (52.11%) limited by memory 52.11% algo=gemm
  15     mobile bwd-data b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224     343.8 GFlops (10.36%)     35.0 GB/s (18.17%) limited by memory 18.17% algo=gemm
  15     mobile bwd-filt b=64 k=3  p=1 s=2 in=3    out=32   g=1   D=224      15.9 GFlops ( 0.48%)      1.6 GB/s ( 0.84%) limited by memory  0.84% algo=gemm
  16     mobile  forward b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56      291.5 GFlops ( 8.78%)    129.6 GB/s (67.23%) limited by memory 67.23% algo=depthwise_separable
  16     mobile bwd-data b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56       72.0 GFlops ( 2.17%)     32.0 GB/s (16.60%) limited by memory 16.60% algo=depthwise_separable
  16     mobile bwd-filt b=64 k=3  p=1 s=1 in=144  out=144  g=144 D=56       69.2 GFlops ( 2.08%)     30.8 GB/s (15.96%) limited by memory 15.96% algo=depthwise_separable
  17     mobile  forward b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56       13.5 GFlops ( 0.41%)     15.0 GB/s ( 7.81%) limited by memory  7.81% algo=gemm
  17     mobile bwd-data b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56       10.6 GFlops ( 0.32%)     11.8 GB/s ( 6.13%) limited by memory  6.13% algo=gemm
  17     mobile bwd-filt b=64 k=3  p=1 s=2 in=144  out=144  g=144 D=56       33.1 GFlops ( 1.00%)     36.8 GB/s (19.11%) limited by memory 19.11% algo=gemm
  18     mobile  forward b=64 k=1  p=0 s=1 in=144  out=24   g=1   D=56     1121.4 GFlops (33.79%)    109.0 GB/s (56.58%) limited by memory 56.58% algo=gemm
  18     mobile bwd-data b=64 k=1  p=0 s=1 in=144  out=24   g=1   D=56      270.5 GFlops ( 8.15%)     26.3 GB/s (13.65%) limited by memory 13.65% algo=gemm
  18     mobile bwd-filt b=64 k=1  p=0 s=1 in=144  out=24   g=1   D=56      169.3 GFlops ( 5.10%)     16.5 GB/s ( 8.54%) limited by memory  8.54% algo=gemm
  19     mobile  forward b=64 k=1  p=0 s=1 in=24   out=144  g=1   D=56     1101.7 GFlops (33.20%)    107.1 GB/s (55.59%) limited by memory 55.59% algo=gemm
  19     mobile bwd-data b=64 k=1  p=0 s=1 in=24   out=144  g=1   D=56      944.2 GFlops (28.45%)     91.8 GB/s (47.64%) limited by memory 47.64% algo=gemm
  19     mobile bwd-filt b=64 k=1  p=0 s=1 in=24   out=144  g=1   D=56      172.4 GFlops ( 5.19%)     16.8 GB/s ( 8.70%) limited by memory  8.70% algo=gemm
  20     mobile  forward b=64 k=1  p=0 s=1 in=960  out=160  g=1   D=7      1764.3 GFlops (53.16%)     26.9 GB/s (13.94%) limited by gflops 53.16% algo=gemm
  20     mobile bwd-data b=64 k=1  p=0 s=1 in=960  out=160  g=1   D=7      1771.5 GFlops (53.38%)     27.0 GB/s (13.99%) limited by gflops 53.38% algo=gemm
  20     mobile bwd-filt b=64 k=1  p=0 s=1 in=960  out=160  g=1   D=7       577.7 GFlops (17.41%)      9.2 GB/s ( 4.75%) limited by gflops 17.41% algo=gemm
  21     mobile  forward b=64 k=1  p=0 s=1 in=960  out=320  g=1   D=7      2157.6 GFlops (65.02%)     19.4 GB/s (10.04%) limited by gflops 65.02% algo=gemm
  21     mobile bwd-data b=64 k=1  p=0 s=1 in=960  out=320  g=1   D=7      2135.8 GFlops (64.36%)     19.2 GB/s ( 9.94%) limited by gflops 64.36% algo=gemm
  21     mobile bwd-filt b=64 k=1  p=0 s=1 in=960  out=320  g=1   D=7       988.3 GFlops (29.78%)      9.5 GB/s ( 4.93%) limited by gflops 29.78% algo=gemm
  22     mobile  forward b=64 k=3  p=1 s=1 in=960  out=960  g=960 D=7       156.9 GFlops ( 4.73%)     69.8 GB/s (36.24%) limited by memory 36.24% algo=depthwise_separable
  22     mobile bwd-data b=64 k=3  p=1 s=1 in=960  out=960  g=960 D=7        68.4 GFlops ( 2.06%)     30.4 GB/s (15.80%) limited by memory 15.80% algo=depthwise_separable
  22     mobile bwd-filt b=64 k=3  p=1 s=1 in=960  out=960  g=960 D=7        31.2 GFlops ( 0.94%)     13.9 GB/s ( 7.22%) limited by memory  7.22% algo=depthwise_separable
  23      scale  forward b=64 k=1  p=0 s=1 in=256  out=256  g=256 D=56       48.1 GFlops ( 1.45%)    192.4 GB/s (99.86%) limited by memory 99.86% algo=depthwise_separable
  23      scale bwd-data b=64 k=1  p=0 s=1 in=256  out=256  g=256 D=56       28.6 GFlops ( 0.86%)    114.2 GB/s (59.28%) limited by memory 59.28% algo=depthwise_separable
  23      scale bwd-filt b=64 k=1  p=0 s=1 in=256  out=256  g=256 D=56       49.7 GFlops ( 1.50%)    198.9 GB/s (103.23%) limited by memory 103.23% algo=depthwise_separable
  24      scale  forward b=64 k=1  p=0 s=1 in=1024 out=1024 g=1024 D=7        46.5 GFlops ( 1.40%)    186.2 GB/s (96.62%) limited by memory 96.62% algo=depthwise_separable
  24      scale bwd-data b=64 k=1  p=0 s=1 in=1024 out=1024 g=1024 D=7        21.1 GFlops ( 0.64%)     84.5 GB/s (43.85%) limited by memory 43.85% algo=depthwise_separable
  24      scale bwd-filt b=64 k=1  p=0 s=1 in=1024 out=1024 g=1024 D=7        10.2 GFlops ( 0.31%)     40.8 GB/s (21.18%) limited by memory 21.18% algo=depthwise_separable
Broadcast/Reduce
     float (64,512,24,24)  (64,512,24,24)  (64,512,24,24)        50.0 GFlops ( 1.51%)    300.1 GB/s (155.73%) limited by memory 155.73%
     float (64,512,24,24)  (512,1,1)       (64,512,24,24)        62.3 GFlops ( 1.88%)    249.3 GB/s (129.36%) limited by memory 129.36%
     float (64,512,24,24)  (1,512,1,1)     (1,512,1,1)           43.4 GFlops ( 1.31%)     57.9 GB/s (30.03%) limited by memory 30.03%
     float (64,512,24,24)  (64,512,24,24)  (1,512,1,1)           39.2 GFlops ( 1.18%)    104.6 GB/s (54.30%) limited by memory 54.30%
     float (64,512,24,24)  (64,512,24,24)  (64,1,1,1)            93.0 GFlops ( 2.80%)    247.9 GB/s (128.66%) limited by memory 128.66%
     float (256,1000)      (256,1)         (1)                   19.7 GFlops ( 0.59%)     26.3 GB/s (13.65%) limited by memory 13.65%
      long (64,512,24,24)  (64,512,24,24)  (64,512,24,24)        26.2 GFlops ( 0.79%)    314.9 GB/s (163.40%) limited by memory 163.40%
      long (64,512,24,24)  (512,1,1)       (64,512,24,24)        37.8 GFlops ( 1.14%)    302.8 GB/s (157.11%) limited by memory 157.11%
      long (64,512,24,24)  (1,512,1,1)     (1,512,1,1)           31.4 GFlops ( 0.94%)     83.6 GB/s (43.39%) limited by memory 43.39%
      long (64,512,24,24)  (64,512,24,24)  (1,512,1,1)           15.7 GFlops ( 0.47%)     83.7 GB/s (43.44%) limited by memory 43.44%
      long (64,512,24,24)  (64,512,24,24)  (64,1,1,1)            39.7 GFlops ( 1.20%)    212.0 GB/s (109.99%) limited by memory 109.99%
      long (256,1000)      (256,1)         (1)                   18.1 GFlops ( 0.54%)     48.3 GB/s (25.05%) limited by memory 25.05%
     short (64,512,24,24)  (64,512,24,24)  (64,512,24,24)        76.9 GFlops ( 2.32%)    230.7 GB/s (119.73%) limited by memory 119.73%
     short (64,512,24,24)  (512,1,1)       (64,512,24,24)        66.8 GFlops ( 2.01%)    133.6 GB/s (69.30%) limited by memory 69.30%
     short (64,512,24,24)  (1,512,1,1)     (1,512,1,1)           43.7 GFlops ( 1.32%)     29.1 GB/s (15.12%) limited by memory 15.12%
     short (64,512,24,24)  (64,512,24,24)  (1,512,1,1)           37.7 GFlops ( 1.14%)     50.3 GB/s (26.10%) limited by memory 26.10%
     short (64,512,24,24)  (64,512,24,24)  (64,1,1,1)           114.7 GFlops ( 3.46%)    153.0 GB/s (79.38%) limited by memory 79.38%
     short (256,1000)      (256,1)         (1)                   20.5 GFlops ( 0.62%)     13.7 GB/s ( 7.09%) limited by memory  7.09%
davidlaxer@x86_64-apple-darwin13 build % 
dbl001 commented 1 year ago

Can you build the dlprimitives (outside the pytorch) and run some tests to see that it works properly What other tests would you like me to run?

artyom-beilis commented 1 year ago

LastTest.log.gz

You are running on device 0:0 instead of 0:1

See:

5/33 Testing: test_test_case_conv2d
5/33 Test: test_test_case_conv2d
Command: "/Users/davidlaxer/pytorch_dlprim/dlprimitives/build/test_from_template" "0:0" "/Users/davidlaxer/pytorch_dlprim/dlprimitives/tests/test_case_conv2d.json"
Directory: /Users/davidlaxer/pytorch_dlprim/dlprimitives/build
"test_test_case_conv2d" start time: Jan 26 10:10 PST
Output:
----------------------------------------------------------
Running tests for operator Convolution2D on Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz on Apple

See I mentioned before

(Note you'll probably need to set cmake parameter something like TEST_DEV=1:0 for platform 1 device 0 (according to clinfo -l)

I think you need to rerun cmake .. -DTEST_DEV=0:1

And than run make test

artyom-beilis commented 1 year ago

% dlprim_flops 0:1 0.5 Testing on AMD Radeon Pro 5700 XT Compute Engine on Apple

What I see in the benchmark that it does not use Winograd convolution kernel and it is very important for resnet performance.

What I see in clinfo log

Device Name AMD Radeon Pro 5700 XT Compute Engine Device Vendor AMD Device Vendor ID 0x1021e00

When I check for wingorad compatibility I use:

static bool is_winograd_compatible(Context &ctx,Conv2DSettings const &config)
{
    if(!ctx.is_amd() && !ctx.is_nvidia())
        return false;

And to check AMD I do:

bool Context::is_amd()
{
    if(is_cpu_context())
        return false;
    return device().getInfo<CL_DEVICE_VENDOR_ID>() == 0x1002;
    //return device_extensions().find("cl_amd_") != std::string::npos;
}

While the vendor ID is clearly not the same...

Need to think how to fix it.

artyom-beilis commented 1 year ago

Can you try to change the line:

return device().getInfo<CL_DEVICE_VENDOR_ID>() == 0x1002;

To something like:

auto vendor_id = device().getInfo<CL_DEVICE_VENDOR_ID>() ;
return vendor_id == 0x1002 || vendor_id == 0x1021e00;

And than rerun flops to see that some of the kernels appear to use winograd convolution and not only gemm.

dbl001 commented 1 year ago
% make test
Running tests...
/opt/local/bin/ctest --force-new-ctest-process 
Test project /Users/davidlaxer/pytorch_dlprim/dlprimitives/build
      Start  1: test_test_case_abs
 1/33 Test  #1: test_test_case_abs ...............   Passed    1.03 sec
      Start  2: test_test_case_activation
 2/33 Test  #2: test_test_case_activation ........   Passed    1.36 sec
      Start  3: test_test_case_batchnorm
 3/33 Test  #3: test_test_case_batchnorm .........   Passed    5.06 sec
      Start  4: test_test_case_concat
 4/33 Test  #4: test_test_case_concat ............   Passed    0.11 sec
      Start  5: test_test_case_conv2d
 5/33 Test  #5: test_test_case_conv2d ............   Passed  124.81 sec
      Start  6: test_test_case_conv2d_dsc
 6/33 Test  #6: test_test_case_conv2d_dsc ........   Passed   35.64 sec
      Start  7: test_test_case_conv2d_gemm
 7/33 Test  #7: test_test_case_conv2d_gemm .......   Passed   40.40 sec
      Start  8: test_test_case_conv2d_win
 8/33 Test  #8: test_test_case_conv2d_win ........   Passed   35.59 sec
      Start  9: test_test_case_elementwise
 9/33 Test  #9: test_test_case_elementwise .......   Passed    6.52 sec
      Start 10: test_test_case_global_pooling
10/33 Test #10: test_test_case_global_pooling ....   Passed    5.56 sec
      Start 11: test_test_case_hardtanh
11/33 Test #11: test_test_case_hardtanh ..........   Passed    0.58 sec
      Start 12: test_test_case_inner_product
12/33 Test #12: test_test_case_inner_product .....   Passed   14.59 sec
      Start 13: test_test_case_log_softmax
13/33 Test #13: test_test_case_log_softmax .......   Passed    0.31 sec
      Start 14: test_test_case_mse_loss
14/33 Test #14: test_test_case_mse_loss ..........   Passed    0.32 sec
      Start 15: test_test_case_nll_loss
15/33 Test #15: test_test_case_nll_loss ..........   Passed    0.38 sec
      Start 16: test_test_case_param
16/33 Test #16: test_test_case_param .............   Passed    0.10 sec
      Start 17: test_test_case_pooling2d
17/33 Test #17: test_test_case_pooling2d .........   Passed   69.22 sec
      Start 18: test_test_case_reduction
18/33 Test #18: test_test_case_reduction .........   Passed   19.95 sec
      Start 19: test_test_case_slice
19/33 Test #19: test_test_case_slice .............   Passed    0.07 sec
      Start 20: test_test_case_softmax
20/33 Test #20: test_test_case_softmax ...........   Passed    0.44 sec
      Start 21: test_test_case_softmax_loss
21/33 Test #21: test_test_case_softmax_loss ......   Passed    0.24 sec
      Start 22: test_test_case_threshold
22/33 Test #22: test_test_case_threshold .........   Passed    0.55 sec
      Start 23: test_test_case_tr_conv2d
23/33 Test #23: test_test_case_tr_conv2d .........   Passed   40.02 sec
      Start 24: test_test_case_tr_conv2d_dsc
24/33 Test #24: test_test_case_tr_conv2d_dsc .....   Passed    2.00 sec
      Start 25: test_test_case_tr_conv2d_gemm
25/33 Test #25: test_test_case_tr_conv2d_gemm ....   Passed    2.99 sec
      Start 26: test_test_case_tr_conv2d_win
26/33 Test #26: test_test_case_tr_conv2d_win .....   Passed    2.00 sec
      Start 27: test_net
27/33 Test #27: test_net .........................   Passed    1.69 sec
      Start 28: test_net_nonopt
28/33 Test #28: test_net_nonopt ..................   Passed    0.09 sec
      Start 29: test_json
29/33 Test #29: test_json ........................   Passed    0.13 sec
      Start 30: test_random
30/33 Test #30: test_random ......................   Passed    0.16 sec
      Start 31: test_context
31/33 Test #31: test_context .....................   Passed    0.20 sec
      Start 32: test_util
32/33 Test #32: test_util ........................   Passed   10.22 sec
      Start 33: test_broadcast_reduce
33/33 Test #33: test_broadcast_reduce ............   Passed    7.90 sec

100% tests passed, 0 tests failed out of 33

Total Test time (real) = 430.26 sec
davidlaxer@x86_64-apple-darwin13 build % 
dbl001 commented 1 year ago

... and after editing context.cpp:

 bool Context::is_amd()
    {
        if(is_cpu_context())
            return false;
        auto vendor_id = device().getInfo<CL_DEVICE_VENDOR_ID>() ;
        return vendor_id == 0x1002 || vendor_id == 0x1021e00;
        //return device_extensions().find("cl_amd_") != std::string::npos;
    }

Test

% make test
Running tests...
/opt/local/bin/ctest --force-new-ctest-process 
Test project /Users/davidlaxer/pytorch_dlprim/dlprimitives/build
      Start  1: test_test_case_abs
 1/33 Test  #1: test_test_case_abs ...............   Passed    0.76 sec
      Start  2: test_test_case_activation
 2/33 Test  #2: test_test_case_activation ........   Passed    1.17 sec
      Start  3: test_test_case_batchnorm
 3/33 Test  #3: test_test_case_batchnorm .........   Passed    4.82 sec
      Start  4: test_test_case_concat
 4/33 Test  #4: test_test_case_concat ............   Passed    0.08 sec
      Start  5: test_test_case_conv2d
 5/33 Test  #5: test_test_case_conv2d ............***Failed    2.18 sec
      Start  6: test_test_case_conv2d_dsc
 6/33 Test  #6: test_test_case_conv2d_dsc ........   Passed   58.13 sec
      Start  7: test_test_case_conv2d_gemm
 7/33 Test  #7: test_test_case_conv2d_gemm .......   Passed   34.97 sec
      Start  8: test_test_case_conv2d_win
 8/33 Test  #8: test_test_case_conv2d_win ........***Failed    0.54 sec
      Start  9: test_test_case_elementwise
 9/33 Test  #9: test_test_case_elementwise .......   Passed    0.63 sec
      Start 10: test_test_case_global_pooling
10/33 Test #10: test_test_case_global_pooling ....   Passed    4.06 sec
      Start 11: test_test_case_hardtanh
11/33 Test #11: test_test_case_hardtanh ..........   Passed    0.53 sec
      Start 12: test_test_case_inner_product
12/33 Test #12: test_test_case_inner_product .....   Passed   17.31 sec
      Start 13: test_test_case_log_softmax
13/33 Test #13: test_test_case_log_softmax .......   Passed    0.16 sec
      Start 14: test_test_case_mse_loss
14/33 Test #14: test_test_case_mse_loss ..........   Passed    0.08 sec
      Start 15: test_test_case_nll_loss
15/33 Test #15: test_test_case_nll_loss ..........   Passed    0.24 sec
      Start 16: test_test_case_param
16/33 Test #16: test_test_case_param .............   Passed    0.07 sec
      Start 17: test_test_case_pooling2d
17/33 Test #17: test_test_case_pooling2d .........   Passed   68.12 sec
      Start 18: test_test_case_reduction
18/33 Test #18: test_test_case_reduction .........   Passed   16.88 sec
      Start 19: test_test_case_slice
19/33 Test #19: test_test_case_slice .............   Passed    0.10 sec
      Start 20: test_test_case_softmax
20/33 Test #20: test_test_case_softmax ...........   Passed    0.18 sec
      Start 21: test_test_case_softmax_loss
21/33 Test #21: test_test_case_softmax_loss ......   Passed    0.14 sec
      Start 22: test_test_case_threshold
22/33 Test #22: test_test_case_threshold .........   Passed    0.49 sec
      Start 23: test_test_case_tr_conv2d
23/33 Test #23: test_test_case_tr_conv2d .........***Failed    0.75 sec
      Start 24: test_test_case_tr_conv2d_dsc
24/33 Test #24: test_test_case_tr_conv2d_dsc .....   Passed    7.31 sec
      Start 25: test_test_case_tr_conv2d_gemm
25/33 Test #25: test_test_case_tr_conv2d_gemm ....   Passed    2.36 sec
      Start 26: test_test_case_tr_conv2d_win
26/33 Test #26: test_test_case_tr_conv2d_win .....***Failed    0.79 sec
      Start 27: test_net
27/33 Test #27: test_net .........................Subprocess aborted***Exception:   0.30 sec
      Start 28: test_net_nonopt
28/33 Test #28: test_net_nonopt ..................Subprocess aborted***Exception:   0.06 sec
      Start 29: test_json
29/33 Test #29: test_json ........................   Passed    0.13 sec
      Start 30: test_random
30/33 Test #30: test_random ......................   Passed    0.17 sec
      Start 31: test_context
31/33 Test #31: test_context .....................   Passed    0.16 sec
      Start 32: test_util
32/33 Test #32: test_util ........................   Passed    9.08 sec
      Start 33: test_broadcast_reduce
33/33 Test #33: test_broadcast_reduce ............   Passed    1.35 sec

82% tests passed, 6 tests failed out of 33

Total Test time (real) = 234.11 sec

The following tests FAILED:
      5 - test_test_case_conv2d (Failed)
      8 - test_test_case_conv2d_win (Failed)
     23 - test_test_case_tr_conv2d (Failed)
     26 - test_test_case_tr_conv2d_win (Failed)
     27 - test_net (Subprocess aborted)
     28 - test_net_nonopt (Subprocess aborted)
Errors while running CTest
Output from these tests are in: /Users/davidlaxer/pytorch_dlprim/dlprimitives/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [test] Error 8
davidlaxer@x86_64-apple-darwin13 build % 

LastTest.log.gz