Open dbl001 opened 1 year ago
I know somebody run dlprimitives on M1, unless there are some bugs it should run.
I suggest try building. Start from dlprimitives to check that backend works and than build pytorch backend. Or you can start directly from pytorch - shouldn't be a big problem.
torch.float64, torch.cfloat data types?
No, only 32 bit float is supported at this point.
I installed OpenCL-CLHPP E.g. https://github.com/KhronosGroup/OpenCL-CLHPP
% make test
Running tests...
Test project /Users/davidlaxer/OpenCL-CLHPP/build
Start 1: test_openclhpp_120
1/45 Test #1: test_openclhpp_120 ............................................................... Passed 0.11 sec
Start 2: test_openclhpp_120_CL_HPP_ENABLE_EXCEPTIONS
2/45 Test #2: test_openclhpp_120_CL_HPP_ENABLE_EXCEPTIONS ...................................... Passed 0.11 sec
Start 3: test_openclhpp_120_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
3/45 Test #3: test_openclhpp_120_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................ Passed 0.07 sec
Start 4: test_openclhpp_120_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
4/45 Test #4: test_openclhpp_120_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ... Passed 0.07 sec
Start 5: test_openclhpp_120_CL_HPP_CL_1_2_DEFAULT_BUILD
5/45 Test #5: test_openclhpp_120_CL_HPP_CL_1_2_DEFAULT_BUILD ................................... Passed 0.07 sec
Start 6: test_openclhpp_120_CL_HPP_USE_CL_DEVICE_FISSION
6/45 Test #6: test_openclhpp_120_CL_HPP_USE_CL_DEVICE_FISSION .................................. Passed 0.07 sec
Start 7: test_openclhpp_120_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
7/45 Test #7: test_openclhpp_120_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR ......................... Passed 0.07 sec
Start 8: test_openclhpp_120_CL_HPP_USE_CL_SUB_GROUPS_KHR
8/45 Test #8: test_openclhpp_120_CL_HPP_USE_CL_SUB_GROUPS_KHR .................................. Passed 0.07 sec
Start 9: test_openclhpp_120_CL_HPP_USE_IL_KHR
9/45 Test #9: test_openclhpp_120_CL_HPP_USE_IL_KHR ............................................. Passed 0.07 sec
Start 10: test_openclhpp_200
10/45 Test #10: test_openclhpp_200 ............................................................... Passed 0.11 sec
Start 11: test_openclhpp_200_CL_HPP_ENABLE_EXCEPTIONS
11/45 Test #11: test_openclhpp_200_CL_HPP_ENABLE_EXCEPTIONS ...................................... Passed 0.11 sec
Start 12: test_openclhpp_200_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
12/45 Test #12: test_openclhpp_200_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................ Passed 0.07 sec
Start 13: test_openclhpp_200_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
13/45 Test #13: test_openclhpp_200_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ... Passed 0.07 sec
Start 14: test_openclhpp_200_CL_HPP_CL_1_2_DEFAULT_BUILD
14/45 Test #14: test_openclhpp_200_CL_HPP_CL_1_2_DEFAULT_BUILD ................................... Passed 0.07 sec
Start 15: test_openclhpp_200_CL_HPP_USE_CL_DEVICE_FISSION
15/45 Test #15: test_openclhpp_200_CL_HPP_USE_CL_DEVICE_FISSION .................................. Passed 0.07 sec
Start 16: test_openclhpp_200_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
16/45 Test #16: test_openclhpp_200_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR ......................... Passed 0.07 sec
Start 17: test_openclhpp_200_CL_HPP_USE_CL_SUB_GROUPS_KHR
17/45 Test #17: test_openclhpp_200_CL_HPP_USE_CL_SUB_GROUPS_KHR .................................. Passed 0.07 sec
Start 18: test_openclhpp_200_CL_HPP_USE_IL_KHR
18/45 Test #18: test_openclhpp_200_CL_HPP_USE_IL_KHR ............................................. Passed 0.07 sec
Start 19: test_openclhpp_210
19/45 Test #19: test_openclhpp_210 ............................................................... Passed 0.11 sec
Start 20: test_openclhpp_210_CL_HPP_ENABLE_EXCEPTIONS
20/45 Test #20: test_openclhpp_210_CL_HPP_ENABLE_EXCEPTIONS ...................................... Passed 0.10 sec
Start 21: test_openclhpp_210_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
21/45 Test #21: test_openclhpp_210_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................ Passed 0.07 sec
Start 22: test_openclhpp_210_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
22/45 Test #22: test_openclhpp_210_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ... Passed 0.07 sec
Start 23: test_openclhpp_210_CL_HPP_CL_1_2_DEFAULT_BUILD
23/45 Test #23: test_openclhpp_210_CL_HPP_CL_1_2_DEFAULT_BUILD ................................... Passed 0.07 sec
Start 24: test_openclhpp_210_CL_HPP_USE_CL_DEVICE_FISSION
24/45 Test #24: test_openclhpp_210_CL_HPP_USE_CL_DEVICE_FISSION .................................. Passed 0.07 sec
Start 25: test_openclhpp_210_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
25/45 Test #25: test_openclhpp_210_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR ......................... Passed 0.07 sec
Start 26: test_openclhpp_210_CL_HPP_USE_CL_SUB_GROUPS_KHR
26/45 Test #26: test_openclhpp_210_CL_HPP_USE_CL_SUB_GROUPS_KHR .................................. Passed 0.07 sec
Start 27: test_openclhpp_210_CL_HPP_USE_IL_KHR
27/45 Test #27: test_openclhpp_210_CL_HPP_USE_IL_KHR ............................................. Passed 0.07 sec
Start 28: test_openclhpp_220
28/45 Test #28: test_openclhpp_220 ............................................................... Passed 0.12 sec
Start 29: test_openclhpp_220_CL_HPP_ENABLE_EXCEPTIONS
29/45 Test #29: test_openclhpp_220_CL_HPP_ENABLE_EXCEPTIONS ...................................... Passed 0.10 sec
Start 30: test_openclhpp_220_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
30/45 Test #30: test_openclhpp_220_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................ Passed 0.07 sec
Start 31: test_openclhpp_220_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
31/45 Test #31: test_openclhpp_220_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ... Passed 0.07 sec
Start 32: test_openclhpp_220_CL_HPP_CL_1_2_DEFAULT_BUILD
32/45 Test #32: test_openclhpp_220_CL_HPP_CL_1_2_DEFAULT_BUILD ................................... Passed 0.07 sec
Start 33: test_openclhpp_220_CL_HPP_USE_CL_DEVICE_FISSION
33/45 Test #33: test_openclhpp_220_CL_HPP_USE_CL_DEVICE_FISSION .................................. Passed 0.07 sec
Start 34: test_openclhpp_220_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
34/45 Test #34: test_openclhpp_220_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR ......................... Passed 0.07 sec
Start 35: test_openclhpp_220_CL_HPP_USE_CL_SUB_GROUPS_KHR
35/45 Test #35: test_openclhpp_220_CL_HPP_USE_CL_SUB_GROUPS_KHR .................................. Passed 0.07 sec
Start 36: test_openclhpp_220_CL_HPP_USE_IL_KHR
36/45 Test #36: test_openclhpp_220_CL_HPP_USE_IL_KHR ............................................. Passed 0.07 sec
Start 37: test_openclhpp_300
37/45 Test #37: test_openclhpp_300 ............................................................... Passed 0.11 sec
Start 38: test_openclhpp_300_CL_HPP_ENABLE_EXCEPTIONS
38/45 Test #38: test_openclhpp_300_CL_HPP_ENABLE_EXCEPTIONS ...................................... Passed 0.10 sec
Start 39: test_openclhpp_300_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY
39/45 Test #39: test_openclhpp_300_CL_HPP_ENABLE_SIZE_T_COMPATIBILITY ............................ Passed 0.07 sec
Start 40: test_openclhpp_300_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY
40/45 Test #40: test_openclhpp_300_CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY ... Passed 0.07 sec
Start 41: test_openclhpp_300_CL_HPP_CL_1_2_DEFAULT_BUILD
41/45 Test #41: test_openclhpp_300_CL_HPP_CL_1_2_DEFAULT_BUILD ................................... Passed 0.07 sec
Start 42: test_openclhpp_300_CL_HPP_USE_CL_DEVICE_FISSION
42/45 Test #42: test_openclhpp_300_CL_HPP_USE_CL_DEVICE_FISSION .................................. Passed 0.07 sec
Start 43: test_openclhpp_300_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR
43/45 Test #43: test_openclhpp_300_CL_HPP_USE_CL_IMAGE2D_FROM_BUFFER_KHR ......................... Passed 0.07 sec
Start 44: test_openclhpp_300_CL_HPP_USE_CL_SUB_GROUPS_KHR
44/45 Test #44: test_openclhpp_300_CL_HPP_USE_CL_SUB_GROUPS_KHR .................................. Passed 0.07 sec
Start 45: test_openclhpp_300_CL_HPP_USE_IL_KHR
45/45 Test #45: test_openclhpp_300_CL_HPP_USE_IL_KHR ............................................. Passed 0.07 sec
100% tests passed, 0 tests failed out of 45
Total Test time (real) = 3.47 sec
Set:
$ export OCL_PATH=/Users/davidlaxer/OpenCL-CLHPP/include/CL
$ ls -l $OCL_PATH
total 664
-rw-r--r-- 1 davidlaxer staff 786 Jan 24 12:17 cl2.hpp
-rw-r--r-- 1 davidlaxer staff 334369 Jan 24 12:17 opencl.hpp
% env | grep OCL
OCL_PATH=/Users/davidlaxer/OpenCL-CLHPP/include/CL
But I'm getting this build error:
AI-Feynman) davidlaxer@x86_64-apple-darwin13 build % cmake -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DINCLUDE_DIRS="/Users/davidlaxer/OpenCL-CLHPP/include/CL" -DCMAKE_PREFIX_PATH=/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/share/cmake/Torch ..
-- Caffe2: Found protobuf with new-style protobuf targets.
-- Caffe2: Protobuf version 3.20.1
-- MKL_ARCH: intel64
-- MKL_ROOT /opt/intel/oneapi/mkl/2021.3.0
-- MKL_LINK: dynamic
-- MKL_INTERFACE_FULL: intel_ilp64
-- MKL_THREADING: intel_thread
-- MKL_MPI: mpich
CMake Warning at /Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
CMakeLists.txt:4 (find_package)
=== Status ===
OpenCL: include OCL_PATH-NOTFOUND
lib /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.0.sdk/System/Library/Frameworks/OpenCL.framework
Python: /Users/davidlaxer/anaconda3/envs/AI-Feynman/bin/python3
BLAS: None
HDF5: None
Sqlite3: include /Users/davidlaxer/anaconda3/envs/AI-Feynman/include
lib /Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/libsqlite3.dylib
Protobuf (onnx): disabled
Python dlprim: disabled
-- Configuring done
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
OCL_PATH
used as include directory in directory /Users/davidlaxer/pytorch_dlprim
used as include directory in directory /Users/davidlaxer/pytorch_dlprim
used as include directory in directory /Users/davidlaxer/pytorch_dlprim
used as include directory in directory /Users/davidlaxer/pytorch_dlprim
used as include directory in directory /Users/davidlaxer/pytorch_dlprim
used as include directory in directory /Users/davidlaxer/pytorch_dlprim
used as include directory in directory /Users/davidlaxer/pytorch_dlprim
used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
used as include directory in directory /Users/davidlaxer/pytorch_dlprim/dlprimitives
CMake Error in CMakeLists.txt:
Found relative path while evaluating include directories of "pt_ocl":
"OCL_PATH-NOTFOUND"
CMake Error in CMakeLists.txt:
Found relative path while evaluating include directories of "pt_ocl":
"OCL_PATH-NOTFOUND"
CMake Error in dlprimitives/CMakeLists.txt:
Found relative path while evaluating include directories of "dlprim_core":
"OCL_PATH-NOTFOUND"
CMake Error in dlprimitives/CMakeLists.txt:
Found relative path while evaluating include directories of "dlprim_core":
"OCL_PATH-NOTFOUND"
-- Generating done
CMake Generate step failed. Build files cannot be regenerated correctly.
What step am I missing?
I 'hacked' /Users/davidlaxer/pytorch_dlprim/dlprimitives/include/dlprim/opencl_include.hpp:
# ifdef __APPLE__
//# include <OpenCL/cl2.hpp>
# include <CL/cl2.hpp>
Ran:
% cmake -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DOCL_PATH="/Users/davidlaxer/OpenCL-CLHPP/include/" -DCMAKE_PREFIX_PATH=/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/share/cmake/Torch ..
$ make
$ make install
It built and installed without errors. When I tried to test:
% python mnist.py --device ocl:0
Traceback (most recent call last):
File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 162, in <module>
main()
File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 121, in main
torch.ops.load_library("build/libpt_ocl.so")
File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/_ops.py", line 640, in load_library
ctypes.CDLL(path)
File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/ctypes/__init__.py", line 374, in __init__
self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/davidlaxer/pytorch_dlprim/build/libpt_ocl.so, 0x0006): tried: '/Users/davidlaxer/pytorch_dlprim/build/libpt_ocl.so' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/davidlaxer/pytorch_dlprim/build/libpt_ocl.so' (no such file), '/Users/davidlaxer/pytorch_dlprim/build/libpt_ocl.so' (no such file)
(AI-Feynman) davidlaxer@x86_64-apple-darwin13 pytorch_dlprim % find . -name libpt_ocl.so -ls
torch.ops.load_library("build/libpt_ocl.so")
Change it to build/libpt_ocl.dylib
- on Mac shared objects called dylib
% python mnist.py --device ocl:0
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|████████████████████████████| 9912422/9912422 [00:02<00:00, 3750281.89it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|█████████████████████████████████| 28881/28881 [00:00<00:00, 268680.53it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|████████████████████████████| 1648877/1648877 [00:00<00:00, 2330785.83it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|█████████████████████████████████| 4542/4542 [00:00<00:00, 12141828.41it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw
Using device: ocl:0
Accessing device #0:Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz on Apple
Traceback (most recent call last):
File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 162, in <module>
main()
File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 153, in main
train(args, model, device, train_loader, optimizer, epoch)
File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 53, in train
output = model(data)
File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1488, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/davidlaxer/pytorch_dlprim/mnist.py", line 29, in forward
x = self.conv1(x)
File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1488, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: clEnqueueNDRangeKernel
The 'mnist.py' test program completes with: --device 'mps' and 'cpu'.
% python mnist.py --device mps
Using device: mps
Train Epoch: 1 [0/60000 (0%)] Loss: 2.326377
...
Train Epoch: 5 [59520/60000 (99%)] Loss: 0.000508
Epoch in 25.6s
Test set: Average loss: 0.0289, Accuracy: 9911/10000 (99%)
Done
Looks like first device is actually CPU. You should have another platforms/devices for GPU. Assuming that opencl drivers are installed for the gpu.
What is output of clinfo -l
or clinfo
Correct!
% python mnist.py --device ocl:1
Using device: ocl:1
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Train Epoch: 1 [0/60000 (0%)] Loss: 2.326377
Train Epoch: 1 [640/60000 (1%)] Loss: 1.373414
Train Epoch: 1 [1280/60000 (2%)] Loss: 0.674242
Train Epoch: 1 [1920/60000 (3%)] Loss: 0.342660
...
Train Epoch: 5 [58240/60000 (97%)] Loss: 0.005476
Train Epoch: 5 [58880/60000 (98%)] Loss: 0.002447
Train Epoch: 5 [59520/60000 (99%)] Loss: 0.000584
Epoch in 9.4s
Test set: Average loss: 0.0287, Accuracy: 9900/10000 (99%)
Done
It's faster then 'mps'.
Great!
What is 'mps' device?
I 'hacked' /Users/davidlaxer/pytorch_dlprim/dlprimitives/include/dlprim/opencl_include.hpp:
# ifdef __APPLE__ //# include <OpenCL/cl2.hpp> # include <CL/cl2.hpp>
Ohhh... that is interesting. Probably I'll need to add special case for header detection of both OpenCL/cl2.hpp
and CL/cl2.hpp
Thanks.
Can you build the dlprimitives (outside the pytorch) and run some tests to see that it works properly
(Note you'll probably need to set cmake parameter something like TEST_DEV=1:0
for platform 1 device 0 (according to clinfo -l)
'MPS' is Apple's Metal Performance Shader framework. PyTorch now supports 'MPS' as a backend (e.g. - CUDA, MPS, ...)
https://pytorch.org/docs/stable/notes/mps.html
How do I run test.py? In test.py, when I set device='opencl:1', I get this exception:
% python test.py
Traceback (most recent call last):
File "/Users/davidlaxer/pytorch_dlprim/test.py", line 32, in <module>
grid_dev = grid_src.detach().clone().to(dev)
RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/davidlaxer/pytorch/c10/core/TensorOptions.h":659, please report a bug to PyTorch. This is a grandfathered Caffe2 device type opencl, it shouldn't ever convert to a DispatchKey. File a bug describing what you were doing if you think this is in error.
From iPython:
% ipython
Python 3.10.9 (main, Jan 11 2023, 09:18:20) [Clang 14.0.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import torch
In [2]: import numpy as np
In [3]: probs = torch.tensor(np.loadtxt("/Users/davidlaxer/minGPT/probs0.txt"),
...: dtype=torch.float32, device='opencl:1')
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 probs = torch.tensor(np.loadtxt("/Users/davidlaxer/minGPT/probs0.txt"), dtype=torch.float32, device='opencl:1')
RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/davidlaxer/pytorch/c10/core/TensorOptions.h":659, please report a bug to PyTorch. This is a grandfathered Caffe2 device type opencl, it shouldn't ever convert to a DispatchKey. File a bug describing what you were doing if you think this is in error.
% ./clinfo
Number of platforms 1
Platform Name Apple
Platform Vendor Apple
Platform Version OpenCL 1.2 (Dec 16 2022 20:35:20)
Platform Profile FULL_PROFILE
Platform Extensions cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event
Platform Name Apple
Number of devices 2
Device Name Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
Device Vendor Intel
Device Vendor ID 0xffffffff
Device Version OpenCL 1.2
Driver Version 1.1
Device OpenCL C Version OpenCL C 1.2
Device Type CPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 16
Max clock frequency 3800MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1x1
Max work group size 1024
Preferred work group size multiple (kernel) 1
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 0 (n/a)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 137438953472 (128GiB)
Error Correction support No
Max memory allocation 34359738368 (32GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 64
Global Memory cache line size 16777216 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 1 bytes
Pitch alignment for 2D image buffers 1 pixels
Max 2D image size 8192x8192 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Global
Local memory size 32768 (32KiB)
Max number of constant args 8
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4096 (4KiB)
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Device Extensions cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_APPLE_fp64_basic_ops cl_APPLE_fixed_alpha_channel_orders cl_APPLE_biased_fixed_point_image_formats cl_APPLE_command_queue_priority
Device Name AMD Radeon Pro 5700 XT Compute Engine
Device Vendor AMD
Device Vendor ID 0x1021e00
Device Version OpenCL 1.2
Driver Version 1.2 (Jan 6 2023 19:45:55)
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 40
Max clock frequency 1499MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple (kernel) 32
Preferred / native vector sizes
char 4 / 4
short 2 / 2
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 32, Little-Endian
Global memory size 17163091968 (15.98GiB)
Error Correction support No
Max memory allocation 4290772992 (3.996GiB)
Unified memory for Host and Device No
Minimum alignment for any data type 128 bytes
Alignment of base address 32768 bits (4096 bytes)
Global Memory cache type None
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 8
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 10ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 134217728 (128MiB)
Built-in kernels (n/a)
Device Extensions cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_APPLE_command_queue_priority cl_APPLE_command_queue_select_compute_units cl_khr_fp64
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Apple
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [P0]
clCreateContext(NULL, ...) [default] Success [P0]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Apple
Device Name AMD Radeon Pro 5700 XT Compute Engine
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) Success (1)
Platform Name Apple
Device Name Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Apple
Device Name AMD Radeon Pro 5700 XT Compute Engine
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (2)
Platform Name Apple
Device Name AMD Radeon Pro 5700 XT Compute Engine
Device Name Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
(base) davidlaxer@x86_64-apple-darwin13 clinfo %
'MPS' is Apple's Metal Performance Shader framework. PyTorch now supports 'MPS' as a backend (e.g. - CUDA, MPS, ...)
Interesting. Can you run some benchmarks of opencl vs metal backend, here some examples:
python dlprimitives/tools/validate_network.py --device privateuseone:0 --benchmark --train --model resnet50 --batch 32
python dlprimitives/tools/validate_network.py --device privateuseone:0 --benchmark --model resnet50 --batch 32
Check please for variants:
--model resnet18 --batch 64
--model mobilenet_v2 --batch 64
--model alexnet --batch 64
And of course for mps for comparison - it would be highly interesting how my result is compared to apples Metal results.
In test.py, when I set device='opencl:1',
Because once opencl support was planned and opencl is reserved device type - but it never realised. For out of tree backend I can use privateuseone device that I can rename to 'ocl' as well. But opencl
is reserved.
% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --train --model resnet50 --batch 32
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 17386.135ms warming up
Step -4 969.884ms
Step -3 980.468ms
Step -2 982.220ms
Step -1 975.265ms
Step 0 975.861ms started
Step 1 980.682ms
Step 2 983.092ms
Step 3 981.701ms
Step 4 982.421ms
Step 5 979.700ms
Step 6 974.858ms
Step 7 979.512ms
Step 8 976.215ms
Step 9 975.671ms
Step 10 975.299ms
Step 11 977.323ms
Step 12 977.699ms
Step 13 972.227ms
Step 14 975.673ms
Step 15 984.858ms
Step 16 980.309ms
Step 17 980.738ms
Step 18 976.323ms
Step 19 976.702ms
Time per item 30.573 ms
Time fwd batch 213.335 ms
Time bwd batch 765.008 ms
Time io batch 3.127 ms
Time zro batch 0.000 ms
Time opt batch 0.000 ms
Time per batch 978.343 ms
% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --model resnet50 --batch 32
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 192.147ms warming up
Step -4 184.328ms
Step -3 187.661ms
Step -2 186.471ms
Step -1 186.116ms
Step 0 186.624ms started
Step 1 186.961ms
Step 2 185.654ms
Step 3 186.042ms
Step 4 186.223ms
Step 5 186.441ms
Step 6 185.965ms
Step 7 186.380ms
Step 8 184.538ms
Step 9 186.350ms
Step 10 186.702ms
Step 11 186.063ms
Step 12 185.345ms
Step 13 184.857ms
Step 14 185.585ms
Step 15 186.010ms
Step 16 185.542ms
Step 17 186.767ms
Step 18 186.188ms
Step 19 186.086ms
Time per item 5.813 ms
Time per batch 186.016 ms
% python dlprimitives/tools/validate_network.py --device mps --benchmark --train --model resnet50 --batch 32
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Warming up
Step -5 10277.323ms warming up
Step -4 401.347ms
Step -3 522.126ms
Step -2 661.028ms
Step -1 659.682ms
Step 0 656.417ms started
Step 1 656.925ms
Step 2 656.745ms
Step 3 654.513ms
Step 4 657.099ms
Step 5 657.073ms
Step 6 655.817ms
Step 7 657.810ms
Step 8 653.340ms
Step 9 654.563ms
Step 10 660.629ms
Step 11 655.051ms
Step 12 660.242ms
Step 13 654.396ms
Step 14 658.860ms
Step 15 656.098ms
Step 16 654.666ms
Step 17 653.854ms
Step 18 657.085ms
Step 19 655.118ms
Time per item 20.510 ms
Time fwd batch 570.076 ms
Time bwd batch 86.239 ms
Time io batch 515.912 ms
Time zro batch 0.000 ms
Time opt batch 0.000 ms
Time per batch 656.315 ms
% python dlprimitives/tools/validate_network.py --device mps --benchmark --model resnet50 --batch 32
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Warming up
Step -5 882.587ms warming up
Step -4 86.313ms
Step -3 83.462ms
Step -2 83.408ms
Step -1 82.453ms
Step 0 82.748ms started
Step 1 82.843ms
Step 2 82.410ms
Step 3 82.929ms
Step 4 82.597ms
Step 5 82.722ms
Step 6 82.410ms
Step 7 82.069ms
Step 8 83.158ms
Step 9 81.898ms
Step 10 83.134ms
Step 11 83.021ms
Step 12 83.030ms
Step 13 82.746ms
Step 14 82.470ms
Step 15 82.466ms
Step 16 82.949ms
Step 17 83.567ms
Step 18 83.144ms
Step 19 83.164ms
Time per item 2.587 ms
Time per batch 82.774 ms
% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --train --model alexnet --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /Users/davidlaxer/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|████████████████████████████████████████| 233M/233M [01:01<00:00, 4.00MB/s]
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 5249.255ms warming up
Step -4 214.265ms
Step -3 212.626ms
Step -2 212.994ms
Step -1 212.822ms
Step 0 213.492ms started
Step 1 214.159ms
Step 2 213.421ms
Step 3 212.886ms
Step 4 212.985ms
Step 5 213.019ms
Step 6 214.607ms
Step 7 213.117ms
Step 8 212.957ms
Step 9 213.335ms
Step 10 213.071ms
Step 11 213.128ms
Step 12 212.807ms
Step 13 212.353ms
Step 14 212.722ms
Step 15 213.727ms
Step 16 213.121ms
Step 17 212.917ms
Step 18 212.744ms
Step 19 213.406ms
Time per item 3.331 ms
Time fwd batch 53.570 ms
Time bwd batch 159.629 ms
Time io batch 4.734 ms
Time zro batch 0.000 ms
Time opt batch 0.000 ms
Time per batch 213.199 ms
(AI-Feynman) davidlaxer@x86_64-apple-darwin13 pytorch_dlprim % python dlprimitives/tools/validate_network.py --device mps --benchmark --train --model alexnet --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Warming up
Step -5 1817.901ms warming up
Step -4 159.780ms
Step -3 130.040ms
Step -2 170.132ms
Step -1 172.625ms
Step 0 173.081ms started
Step 1 169.357ms
Step 2 172.887ms
Step 3 173.597ms
Step 4 172.193ms
Step 5 172.748ms
Step 6 172.532ms
Step 7 171.034ms
Step 8 173.197ms
Step 9 174.854ms
Step 10 169.927ms
Step 11 172.000ms
Step 12 171.020ms
Step 13 173.021ms
Step 14 173.242ms
Step 15 175.806ms
Step 16 172.851ms
Step 17 170.872ms
Step 18 172.756ms
Step 19 170.169ms
Time per item 2.693 ms
Time fwd batch 161.963 ms
Time bwd batch 10.394 ms
Time io batch 155.241 ms
Time zro batch 0.000 ms
Time opt batch 0.000 ms
Time per batch 172.357 ms
% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --train --model mobilenet_v2 --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MobileNet_V2_Weights.IMAGENET1K_V1`. You can also use `weights=MobileNet_V2_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /Users/davidlaxer/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
100%|██████████████████████████████████████| 13.6M/13.6M [00:03<00:00, 4.05MB/s]
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 13280.064ms warming up
Step -4 752.361ms
Step -3 739.524ms
Step -2 745.049ms
Step -1 745.150ms
Step 0 744.964ms started
Step 1 744.489ms
Step 2 742.961ms
Step 3 746.302ms
Step 4 743.279ms
Step 5 737.969ms
Step 6 744.280ms
Step 7 742.577ms
Step 8 741.941ms
Step 9 746.158ms
Step 10 744.912ms
Step 11 740.043ms
Step 12 737.633ms
Step 13 741.226ms
Step 14 741.452ms
Step 15 741.393ms
Step 16 739.347ms
Step 17 740.438ms
Step 18 741.509ms
Step 19 740.733ms
Time per item 11.597 ms
Time fwd batch 161.055 ms
Time bwd batch 581.125 ms
Time io batch 5.534 ms
Time zro batch 0.000 ms
Time opt batch 0.000 ms
Time per batch 742.180 ms
(AI-Feynman) davidlaxer@x86_64-apple-darwin13 pytorch_dlprim % python dlprimitives/tools/validate_network.py --device mps --benchmark --train --model mobilenet_v2 --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MobileNet_V2_Weights.IMAGENET1K_V1`. You can also use `weights=MobileNet_V2_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Warming up
Step -5 9177.353ms warming up
Step -4 911.196ms
Step -3 2079.731ms
Step -2 2267.814ms
Step -1 2244.676ms
Step 0 2275.510ms started
Step 1 2261.282ms
Step 2 2257.297ms
Step 3 2263.499ms
Step 4 2258.673ms
Step 5 2266.446ms
Step 6 2280.577ms
Step 7 2253.364ms
Step 8 2246.077ms
Step 9 2288.277ms
Step 10 2287.250ms
Step 11 2255.581ms
Step 12 2267.239ms
Step 13 2268.039ms
Step 14 2283.474ms
Step 15 2269.081ms
Step 16 2266.289ms
Step 17 2273.217ms
Step 18 2283.813ms
Step 19 2276.918ms
Time per item 35.455 ms
Time fwd batch 2192.135 ms
Time bwd batch 76.960 ms
Time io batch 2137.610 ms
Time zro batch 0.000 ms
Time opt batch 0.000 ms
Time per batch 2269.095 ms
% python dlprimitives/tools/validate_network.py --device privateuseone:1 --benchmark --train --model resnet18 --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /Users/davidlaxer/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████████████████████████████████| 44.7M/44.7M [00:11<00:00, 4.00MB/s]
Accessing device #1:AMD Radeon Pro 5700 XT Compute Engine on Apple
Warming up
Step -5 5812.478ms warming up
Step -4 862.062ms
Step -3 863.053ms
Step -2 862.100ms
Step -1 861.751ms
Step 0 859.801ms started
Step 1 861.398ms
Step 2 861.833ms
Step 3 858.887ms
Step 4 861.570ms
Step 5 860.864ms
Step 6 861.934ms
Step 7 862.494ms
Step 8 864.035ms
Step 9 858.804ms
Step 10 856.521ms
Step 11 856.596ms
Step 12 863.216ms
Step 13 861.385ms
Step 14 861.439ms
Step 15 859.234ms
Step 16 860.556ms
Step 17 861.595ms
Step 18 863.927ms
Step 19 860.451ms
Time per item 13.450 ms
Time fwd batch 166.798 ms
Time bwd batch 694.029 ms
Time io batch 5.395 ms
Time zro batch 0.000 ms
Time opt batch 0.000 ms
Time per batch 860.827 ms
(AI-Feynman) davidlaxer@x86_64-apple-darwin13 pytorch_dlprim % python dlprimitives/tools/validate_network.py --device mps --benchmark --train --model resnet18 --batch 64
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torchvision-0.15.0a0+8985b59-py3.10-macosx-10.9-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Warming up
Step -5 2292.475ms warming up
Step -4 407.617ms
Step -3 539.161ms
Step -2 589.164ms
Step -1 591.197ms
Step 0 588.727ms started
Step 1 590.627ms
Step 2 598.048ms
Step 3 590.203ms
Step 4 587.857ms
Step 5 590.700ms
Step 6 589.184ms
Step 7 589.716ms
Step 8 597.186ms
Step 9 590.756ms
Step 10 595.828ms
Step 11 598.534ms
Step 12 593.455ms
Step 13 596.133ms
Step 14 588.126ms
Step 15 595.137ms
Step 16 591.288ms
Step 17 590.055ms
Step 18 592.019ms
Step 19 582.737ms
Time per item 9.247 ms
Time fwd batch 560.283 ms
Time bwd batch 31.532 ms
Time io batch 537.801 ms
Time zro batch 0.000 ms
Time opt batch 0.000 ms
Time per batch 591.816 ms
There were a two errors bulding dlprimitives error in
vi +112 /Users/davidlaxer/pytorch_dlprim/dlprimitives/src/importers/onnx.cpp
vi +50 /Users/davidlaxer/anaconda3/envs/AI-Feynman/include/boost/python/object/make_instance.hpp
Here are the results of running $make test
Running tests...
/opt/local/bin/ctest --force-new-ctest-process
Test project /Users/davidlaxer/pytorch_dlprim/dlprimitives/build
Start 1: test_test_case_abs
1/33 Test #1: test_test_case_abs ............... Passed 1.12 sec
Start 2: test_test_case_activation
2/33 Test #2: test_test_case_activation ........***Failed 0.09 sec
Start 3: test_test_case_batchnorm
3/33 Test #3: test_test_case_batchnorm .........***Failed 0.42 sec
Start 4: test_test_case_concat
4/33 Test #4: test_test_case_concat ............ Passed 0.09 sec
Start 5: test_test_case_conv2d
5/33 Test #5: test_test_case_conv2d ............***Failed 0.77 sec
Start 6: test_test_case_conv2d_dsc
6/33 Test #6: test_test_case_conv2d_dsc ........***Failed 0.41 sec
Start 7: test_test_case_conv2d_gemm
7/33 Test #7: test_test_case_conv2d_gemm .......***Failed 0.42 sec
Start 8: test_test_case_conv2d_win
8/33 Test #8: test_test_case_conv2d_win ........***Failed 0.42 sec
Start 9: test_test_case_elementwise
9/33 Test #9: test_test_case_elementwise .......***Failed 0.53 sec
Start 10: test_test_case_global_pooling
10/33 Test #10: test_test_case_global_pooling ....***Failed 0.12 sec
Start 11: test_test_case_hardtanh
11/33 Test #11: test_test_case_hardtanh .......... Passed 0.63 sec
Start 12: test_test_case_inner_product
12/33 Test #12: test_test_case_inner_product .....***Failed 0.20 sec
Start 13: test_test_case_log_softmax
13/33 Test #13: test_test_case_log_softmax .......***Failed 0.09 sec
Start 14: test_test_case_mse_loss
14/33 Test #14: test_test_case_mse_loss .......... Passed 0.27 sec
Start 15: test_test_case_nll_loss
15/33 Test #15: test_test_case_nll_loss ..........***Failed 0.24 sec
Start 16: test_test_case_param
16/33 Test #16: test_test_case_param ............. Passed 0.11 sec
Start 17: test_test_case_pooling2d
17/33 Test #17: test_test_case_pooling2d .........***Failed 0.08 sec
Start 18: test_test_case_reduction
18/33 Test #18: test_test_case_reduction .........***Failed 0.18 sec
Start 19: test_test_case_slice
19/33 Test #19: test_test_case_slice .............***Failed 0.05 sec
Start 20: test_test_case_softmax
20/33 Test #20: test_test_case_softmax ...........***Failed 0.09 sec
Start 21: test_test_case_softmax_loss
21/33 Test #21: test_test_case_softmax_loss ......***Failed 0.15 sec
Start 22: test_test_case_threshold
22/33 Test #22: test_test_case_threshold ......... Passed 0.56 sec
Start 23: test_test_case_tr_conv2d
23/33 Test #23: test_test_case_tr_conv2d .........***Failed 0.70 sec
Start 24: test_test_case_tr_conv2d_dsc
24/33 Test #24: test_test_case_tr_conv2d_dsc .....***Failed 0.66 sec
Start 25: test_test_case_tr_conv2d_gemm
25/33 Test #25: test_test_case_tr_conv2d_gemm ....***Failed 0.65 sec
Start 26: test_test_case_tr_conv2d_win
26/33 Test #26: test_test_case_tr_conv2d_win .....***Failed 0.66 sec
Start 27: test_net
27/33 Test #27: test_net .........................Subprocess aborted***Exception: 0.28 sec
Start 28: test_net_nonopt
28/33 Test #28: test_net_nonopt ..................Subprocess aborted***Exception: 0.05 sec
Start 29: test_json
29/33 Test #29: test_json ........................ Passed 0.13 sec
Start 30: test_random
30/33 Test #30: test_random ...................... Passed 0.30 sec
Start 31: test_context
31/33 Test #31: test_context ..................... Passed 0.18 sec
Start 32: test_util
32/33 Test #32: test_util ........................***Failed 0.30 sec
Start 33: test_broadcast_reduce
33/33 Test #33: test_broadcast_reduce ............***Failed 0.17 sec
27% tests passed, 24 tests failed out of 33
Total Test time (real) = 11.08 sec
The following tests FAILED:
2 - test_test_case_activation (Failed)
3 - test_test_case_batchnorm (Failed)
5 - test_test_case_conv2d (Failed)
6 - test_test_case_conv2d_dsc (Failed)
7 - test_test_case_conv2d_gemm (Failed)
8 - test_test_case_conv2d_win (Failed)
9 - test_test_case_elementwise (Failed)
10 - test_test_case_global_pooling (Failed)
12 - test_test_case_inner_product (Failed)
13 - test_test_case_log_softmax (Failed)
15 - test_test_case_nll_loss (Failed)
17 - test_test_case_pooling2d (Failed)
18 - test_test_case_reduction (Failed)
19 - test_test_case_slice (Failed)
20 - test_test_case_softmax (Failed)
21 - test_test_case_softmax_loss (Failed)
23 - test_test_case_tr_conv2d (Failed)
24 - test_test_case_tr_conv2d_dsc (Failed)
25 - test_test_case_tr_conv2d_gemm (Failed)
26 - test_test_case_tr_conv2d_win (Failed)
27 - test_net (Subprocess aborted)
28 - test_net_nonopt (Subprocess aborted)
32 - test_util (Failed)
33 - test_broadcast_reduce (Failed)
Errors while running CTest
Output from these tests are in: /Users/davidlaxer/pytorch_dlprim/dlprimitives/build/Testing/Temporary/LastTest.log
[LastTest.log](https://github.com/artyom-beilis/pytorch_dlprim/files/10512218/LastTest.log)
[LastTest.log](https://github.com/artyom-beilis/pytorch_dlprim/files/10512218/LastTest.log)
% dlprim_flops 0:1 0.5
Testing on AMD Radeon Pro 5700 XT Compute Engine on Apple
Testing memory speed
- Vector size 1
-- Warming
-- Running 28.4175 GB/s
- Vector size 2
-- Warming
-- Running 46.5139 GB/s
- Vector size 4
-- Warming
-- Running 202.604 GB/s
- Vector size 8
-- Warming
-- Running 207.637 GB/s
- Vector size 16
-- Warming
-- Running 192.703 GB/s
Testing flops float
- Vector size 1
-- Warming
-- Running 3318.61 GFlops
- Vector size 2
-- Warming
-- Running 3315.52 GFlops
- Vector size 4
-- Warming
-- Running 3244.98 GFlops
- Vector size 8
-- Warming
-- Running 2923.09 GFlops
- Vector size 16
-- Warming
-- Running 2579.09 GFlops
Summray for AMD Radeon Pro 5700 XT Compute Engine on Apple
Peak GFlops for float 3318.61
Peak memory 207.637 GB/s
GEMM
NN 0: 512, 512, 512 762.1 GFlops (22.97%) 8.9 GB/s ( 4.64%) limited by gflops 22.97%
NN 1: 1024, 1024, 1024 2067.6 GFlops (62.30%) 12.1 GB/s ( 6.29%) limited by gflops 62.30%
NN 2: 1025, 1025, 1025 1588.5 GFlops (47.87%) 9.3 GB/s ( 4.83%) limited by gflops 47.87%
NN 3: 2048, 2048, 2048 2503.3 GFlops (75.43%) 7.3 GB/s ( 3.81%) limited by gflops 75.43%
NN 4: 2049, 2049, 2049 2505.1 GFlops (75.49%) 7.3 GB/s ( 3.81%) limited by gflops 75.49%
NN 5: 64, 2048, 64 422.1 GFlops (12.72%) 27.0 GB/s (14.01%) limited by memory 14.01%
NN 6: 2048, 64, 2048 1406.6 GFlops (42.39%) 46.7 GB/s (24.24%) limited by gflops 42.39%
NN 7: 2048, 2048, 64 1159.8 GFlops (34.95%) 38.8 GB/s (20.14%) limited by gflops 34.95%
NN 8: 2048, 64, 64 404.9 GFlops (12.20%) 25.9 GB/s (13.44%) limited by memory 13.44%
NN 9: 64, 2048, 2048 1786.9 GFlops (53.84%) 59.3 GB/s (30.80%) limited by gflops 53.84%
NN 10: 64, 64, 2048 91.1 GFlops ( 2.75%) 5.8 GB/s ( 3.00%) limited by memory 3.00%
NT 0: 512, 512, 512 712.3 GFlops (21.46%) 8.4 GB/s ( 4.34%) limited by gflops 21.46%
NT 1: 1024, 1024, 1024 1767.3 GFlops (53.25%) 10.4 GB/s ( 5.38%) limited by gflops 53.25%
NT 2: 1025, 1025, 1025 1589.2 GFlops (47.89%) 9.3 GB/s ( 4.83%) limited by gflops 47.89%
NT 3: 2048, 2048, 2048 2214.6 GFlops (66.73%) 6.5 GB/s ( 3.37%) limited by gflops 66.73%
NT 4: 2049, 2049, 2049 2524.4 GFlops (76.07%) 7.4 GB/s ( 3.84%) limited by gflops 76.07%
NT 5: 64, 2048, 64 452.6 GFlops (13.64%) 29.0 GB/s (15.03%) limited by memory 15.03%
NT 6: 2048, 64, 2048 1200.0 GFlops (36.16%) 39.9 GB/s (20.68%) limited by gflops 36.16%
NT 7: 2048, 2048, 64 1136.3 GFlops (34.24%) 38.0 GB/s (19.73%) limited by gflops 34.24%
NT 8: 2048, 64, 64 439.5 GFlops (13.24%) 28.1 GB/s (14.59%) limited by memory 14.59%
NT 9: 64, 2048, 2048 1463.2 GFlops (44.09%) 48.6 GB/s (25.22%) limited by gflops 44.09%
NT 10: 64, 64, 2048 80.0 GFlops ( 2.41%) 5.1 GB/s ( 2.64%) limited by memory 2.64%
TN 0: 512, 512, 512 877.7 GFlops (26.45%) 10.3 GB/s ( 5.34%) limited by gflops 26.45%
TN 1: 1024, 1024, 1024 2222.2 GFlops (66.96%) 13.0 GB/s ( 6.76%) limited by gflops 66.96%
TN 2: 1025, 1025, 1025 1559.0 GFlops (46.98%) 9.1 GB/s ( 4.74%) limited by gflops 46.98%
TN 3: 2048, 2048, 2048 2737.5 GFlops (82.49%) 8.0 GB/s ( 4.16%) limited by gflops 82.49%
TN 4: 2049, 2049, 2049 2476.7 GFlops (74.63%) 7.3 GB/s ( 3.76%) limited by gflops 74.63%
TN 5: 64, 2048, 64 414.2 GFlops (12.48%) 26.5 GB/s (13.75%) limited by memory 13.75%
TN 6: 2048, 64, 2048 1805.5 GFlops (54.41%) 60.0 GB/s (31.12%) limited by gflops 54.41%
TN 7: 2048, 2048, 64 1160.7 GFlops (34.98%) 38.8 GB/s (20.16%) limited by gflops 34.98%
TN 8: 2048, 64, 64 385.2 GFlops (11.61%) 24.6 GB/s (12.79%) limited by memory 12.79%
TN 9: 64, 2048, 2048 1840.2 GFlops (55.45%) 61.1 GB/s (31.71%) limited by gflops 55.45%
TN 10: 64, 64, 2048 97.0 GFlops ( 2.92%) 6.2 GB/s ( 3.20%) limited by memory 3.20%
TT 0: 512, 512, 512 797.1 GFlops (24.02%) 9.4 GB/s ( 4.85%) limited by gflops 24.02%
TT 1: 1024, 1024, 1024 2115.2 GFlops (63.74%) 12.4 GB/s ( 6.43%) limited by gflops 63.74%
TT 2: 1025, 1025, 1025 1583.3 GFlops (47.71%) 9.3 GB/s ( 4.81%) limited by gflops 47.71%
TT 3: 2048, 2048, 2048 2633.0 GFlops (79.34%) 7.7 GB/s ( 4.00%) limited by gflops 79.34%
TT 4: 2049, 2049, 2049 2514.8 GFlops (75.78%) 7.4 GB/s ( 3.82%) limited by gflops 75.78%
TT 5: 64, 2048, 64 432.7 GFlops (13.04%) 27.7 GB/s (14.37%) limited by memory 14.37%
TT 6: 2048, 64, 2048 1728.6 GFlops (52.09%) 57.4 GB/s (29.79%) limited by gflops 52.09%
TT 7: 2048, 2048, 64 1154.7 GFlops (34.79%) 38.6 GB/s (20.05%) limited by gflops 34.79%
TT 8: 2048, 64, 64 425.8 GFlops (12.83%) 27.2 GB/s (14.14%) limited by memory 14.14%
TT 9: 64, 2048, 2048 1492.1 GFlops (44.96%) 49.6 GB/s (25.71%) limited by gflops 44.96%
TT 10: 64, 64, 2048 84.6 GFlops ( 2.55%) 5.4 GB/s ( 2.79%) limited by memory 2.79%
Convolution
0 effnet forward b=64 k=3 p=1 s=1 in=480 out=480 g=480 D=14 247.8 GFlops ( 7.47%) 110.2 GB/s (57.18%) limited by memory 57.18% algo=depthwise_separable
0 effnet bwd-data b=64 k=3 p=1 s=1 in=480 out=480 g=480 D=14 98.2 GFlops ( 2.96%) 43.7 GB/s (22.65%) limited by memory 22.65% algo=depthwise_separable
0 effnet bwd-filt b=64 k=3 p=1 s=1 in=480 out=480 g=480 D=14 12.8 GFlops ( 0.38%) 5.7 GB/s ( 2.94%) limited by memory 2.94% algo=depthwise_separable
1 alexnet forward b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 1530.8 GFlops (46.13%) 15.0 GB/s ( 7.79%) limited by gflops 46.13% algo=gemm
1 alexnet bwd-data b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 1071.8 GFlops (32.30%) 10.5 GB/s ( 5.45%) limited by gflops 32.30% algo=gemm
1 alexnet bwd-filt b=64 k=11 p=2 s=4 in=3 out=64 g=1 D=224 204.1 GFlops ( 6.15%) 2.0 GB/s ( 1.04%) limited by gflops 6.15% algo=gemm
2 alexnet forward b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 1615.8 GFlops (48.69%) 4.1 GB/s ( 2.13%) limited by gflops 48.69% algo=gemm
2 alexnet bwd-data b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 1788.1 GFlops (53.88%) 4.5 GB/s ( 2.36%) limited by gflops 53.88% algo=gemm
2 alexnet bwd-filt b=64 k=5 p=2 s=1 in=96 out=192 g=2 D=27 1006.9 GFlops (30.34%) 2.6 GB/s ( 1.35%) limited by gflops 30.34% algo=gemm
3 alexnet forward b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 2052.4 GFlops (61.85%) 3.5 GB/s ( 1.82%) limited by gflops 61.85% algo=gemm
3 alexnet bwd-data b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 2349.6 GFlops (70.80%) 4.0 GB/s ( 2.08%) limited by gflops 70.80% algo=gemm
3 alexnet bwd-filt b=64 k=5 p=2 s=1 in=64 out=192 g=1 D=27 917.5 GFlops (27.65%) 1.6 GB/s ( 0.83%) limited by gflops 27.65% algo=gemm
4 alexnet forward b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 2039.5 GFlops (61.46%) 3.3 GB/s ( 1.73%) limited by gflops 61.46% algo=gemm
4 alexnet bwd-data b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 2539.4 GFlops (76.52%) 4.1 GB/s ( 2.15%) limited by gflops 76.52% algo=gemm
4 alexnet bwd-filt b=64 k=3 p=1 s=1 in=384 out=256 g=1 D=13 1459.0 GFlops (43.96%) 2.7 GB/s ( 1.38%) limited by gflops 43.96% algo=gemm
5 resnet forward b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 1500.8 GFlops (45.22%) 24.3 GB/s (12.59%) limited by gflops 45.22% algo=gemm
5 resnet bwd-data b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 1095.0 GFlops (33.00%) 17.7 GB/s ( 9.18%) limited by gflops 33.00% algo=gemm
5 resnet bwd-filt b=64 k=7 p=3 s=2 in=3 out=64 g=1 D=224 89.0 GFlops ( 2.68%) 1.4 GB/s ( 0.75%) limited by gflops 2.68% algo=gemm
6 resnet forward b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 1753.6 GFlops (52.84%) 68.5 GB/s (35.56%) limited by gflops 52.84% algo=gemm
6 resnet bwd-data b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 2429.6 GFlops (73.21%) 94.9 GB/s (49.26%) limited by gflops 73.21% algo=gemm
6 resnet bwd-filt b=64 k=1 p=0 s=1 in=64 out=256 g=1 D=56 320.3 GFlops ( 9.65%) 12.5 GB/s ( 6.50%) limited by gflops 9.65% algo=gemm
7 resnet forward b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 1768.9 GFlops (53.30%) 110.6 GB/s (57.38%) limited by memory 57.38% algo=gemm
7 resnet bwd-data b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 1308.3 GFlops (39.42%) 81.8 GB/s (42.44%) limited by memory 42.44% algo=gemm
7 resnet bwd-filt b=64 k=1 p=0 s=1 in=64 out=64 g=1 D=56 79.5 GFlops ( 2.39%) 5.0 GB/s ( 2.58%) limited by memory 2.58% algo=gemm
8 resnet forward b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 1593.1 GFlops (48.01%) 11.1 GB/s ( 5.75%) limited by gflops 48.01% algo=gemm
8 resnet bwd-data b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 1387.4 GFlops (41.81%) 9.6 GB/s ( 5.01%) limited by gflops 41.81% algo=gemm
8 resnet bwd-filt b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=56 357.9 GFlops (10.78%) 2.5 GB/s ( 1.29%) limited by gflops 10.78% algo=gemm
9 resnet forward b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 1816.9 GFlops (54.75%) 11.8 GB/s ( 6.13%) limited by gflops 54.75% algo=gemm
9 resnet bwd-data b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 2231.4 GFlops (67.24%) 14.5 GB/s ( 7.52%) limited by gflops 67.24% algo=gemm
9 resnet bwd-filt b=64 k=1 p=0 s=2 in=1024 out=2048 g=1 D=14 1888.2 GFlops (56.90%) 13.5 GB/s ( 6.99%) limited by gflops 56.90% algo=gemm
10 resnet forward b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 2527.9 GFlops (76.17%) 25.1 GB/s (13.02%) limited by gflops 76.17% algo=gemm
10 resnet bwd-data b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 2365.8 GFlops (71.29%) 23.5 GB/s (12.18%) limited by gflops 71.29% algo=gemm
10 resnet bwd-filt b=64 k=1 p=0 s=1 in=1024 out=256 g=1 D=14 878.5 GFlops (26.47%) 8.9 GB/s ( 4.60%) limited by gflops 26.47% algo=gemm
11 resnet forward b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 2264.9 GFlops (68.25%) 4.3 GB/s ( 2.23%) limited by gflops 68.25% algo=gemm
11 resnet bwd-data b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 2451.8 GFlops (73.88%) 4.6 GB/s ( 2.41%) limited by gflops 73.88% algo=gemm
11 resnet bwd-filt b=64 k=3 p=1 s=1 in=256 out=256 g=1 D=14 1604.9 GFlops (48.36%) 3.3 GB/s ( 1.71%) limited by gflops 48.36% algo=gemm
12 vgg forward b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 1078.7 GFlops (32.51%) 83.7 GB/s (43.41%) limited by memory 43.41% algo=gemm
12 vgg bwd-data b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 463.6 GFlops (13.97%) 36.0 GB/s (18.66%) limited by memory 18.66% algo=gemm
12 vgg bwd-filt b=64 k=3 p=1 s=1 in=3 out=64 g=1 D=224 33.4 GFlops ( 1.01%) 2.6 GB/s ( 1.34%) limited by memory 1.34% algo=gemm
13 vgg forward b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 1574.0 GFlops (47.43%) 10.9 GB/s ( 5.67%) limited by gflops 47.43% algo=gemm
13 vgg bwd-data b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 1069.4 GFlops (32.22%) 7.4 GB/s ( 3.85%) limited by gflops 32.22% algo=gemm
13 vgg bwd-filt b=64 k=3 p=1 s=1 in=64 out=64 g=1 D=224 342.8 GFlops (10.33%) 2.4 GB/s ( 1.24%) limited by gflops 10.33% algo=gemm
14 vgg forward b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 2474.8 GFlops (74.57%) 2.2 GB/s ( 1.17%) limited by gflops 74.57% algo=gemm
14 vgg bwd-data b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 2835.9 GFlops (85.46%) 2.6 GB/s ( 1.34%) limited by gflops 85.46% algo=gemm
14 vgg bwd-filt b=64 k=3 p=1 s=1 in=512 out=512 g=1 D=28 1990.9 GFlops (59.99%) 1.9 GB/s ( 0.98%) limited by gflops 59.99% algo=gemm
15 mobile forward b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 985.9 GFlops (29.71%) 100.4 GB/s (52.11%) limited by memory 52.11% algo=gemm
15 mobile bwd-data b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 343.8 GFlops (10.36%) 35.0 GB/s (18.17%) limited by memory 18.17% algo=gemm
15 mobile bwd-filt b=64 k=3 p=1 s=2 in=3 out=32 g=1 D=224 15.9 GFlops ( 0.48%) 1.6 GB/s ( 0.84%) limited by memory 0.84% algo=gemm
16 mobile forward b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 291.5 GFlops ( 8.78%) 129.6 GB/s (67.23%) limited by memory 67.23% algo=depthwise_separable
16 mobile bwd-data b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 72.0 GFlops ( 2.17%) 32.0 GB/s (16.60%) limited by memory 16.60% algo=depthwise_separable
16 mobile bwd-filt b=64 k=3 p=1 s=1 in=144 out=144 g=144 D=56 69.2 GFlops ( 2.08%) 30.8 GB/s (15.96%) limited by memory 15.96% algo=depthwise_separable
17 mobile forward b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56 13.5 GFlops ( 0.41%) 15.0 GB/s ( 7.81%) limited by memory 7.81% algo=gemm
17 mobile bwd-data b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56 10.6 GFlops ( 0.32%) 11.8 GB/s ( 6.13%) limited by memory 6.13% algo=gemm
17 mobile bwd-filt b=64 k=3 p=1 s=2 in=144 out=144 g=144 D=56 33.1 GFlops ( 1.00%) 36.8 GB/s (19.11%) limited by memory 19.11% algo=gemm
18 mobile forward b=64 k=1 p=0 s=1 in=144 out=24 g=1 D=56 1121.4 GFlops (33.79%) 109.0 GB/s (56.58%) limited by memory 56.58% algo=gemm
18 mobile bwd-data b=64 k=1 p=0 s=1 in=144 out=24 g=1 D=56 270.5 GFlops ( 8.15%) 26.3 GB/s (13.65%) limited by memory 13.65% algo=gemm
18 mobile bwd-filt b=64 k=1 p=0 s=1 in=144 out=24 g=1 D=56 169.3 GFlops ( 5.10%) 16.5 GB/s ( 8.54%) limited by memory 8.54% algo=gemm
19 mobile forward b=64 k=1 p=0 s=1 in=24 out=144 g=1 D=56 1101.7 GFlops (33.20%) 107.1 GB/s (55.59%) limited by memory 55.59% algo=gemm
19 mobile bwd-data b=64 k=1 p=0 s=1 in=24 out=144 g=1 D=56 944.2 GFlops (28.45%) 91.8 GB/s (47.64%) limited by memory 47.64% algo=gemm
19 mobile bwd-filt b=64 k=1 p=0 s=1 in=24 out=144 g=1 D=56 172.4 GFlops ( 5.19%) 16.8 GB/s ( 8.70%) limited by memory 8.70% algo=gemm
20 mobile forward b=64 k=1 p=0 s=1 in=960 out=160 g=1 D=7 1764.3 GFlops (53.16%) 26.9 GB/s (13.94%) limited by gflops 53.16% algo=gemm
20 mobile bwd-data b=64 k=1 p=0 s=1 in=960 out=160 g=1 D=7 1771.5 GFlops (53.38%) 27.0 GB/s (13.99%) limited by gflops 53.38% algo=gemm
20 mobile bwd-filt b=64 k=1 p=0 s=1 in=960 out=160 g=1 D=7 577.7 GFlops (17.41%) 9.2 GB/s ( 4.75%) limited by gflops 17.41% algo=gemm
21 mobile forward b=64 k=1 p=0 s=1 in=960 out=320 g=1 D=7 2157.6 GFlops (65.02%) 19.4 GB/s (10.04%) limited by gflops 65.02% algo=gemm
21 mobile bwd-data b=64 k=1 p=0 s=1 in=960 out=320 g=1 D=7 2135.8 GFlops (64.36%) 19.2 GB/s ( 9.94%) limited by gflops 64.36% algo=gemm
21 mobile bwd-filt b=64 k=1 p=0 s=1 in=960 out=320 g=1 D=7 988.3 GFlops (29.78%) 9.5 GB/s ( 4.93%) limited by gflops 29.78% algo=gemm
22 mobile forward b=64 k=3 p=1 s=1 in=960 out=960 g=960 D=7 156.9 GFlops ( 4.73%) 69.8 GB/s (36.24%) limited by memory 36.24% algo=depthwise_separable
22 mobile bwd-data b=64 k=3 p=1 s=1 in=960 out=960 g=960 D=7 68.4 GFlops ( 2.06%) 30.4 GB/s (15.80%) limited by memory 15.80% algo=depthwise_separable
22 mobile bwd-filt b=64 k=3 p=1 s=1 in=960 out=960 g=960 D=7 31.2 GFlops ( 0.94%) 13.9 GB/s ( 7.22%) limited by memory 7.22% algo=depthwise_separable
23 scale forward b=64 k=1 p=0 s=1 in=256 out=256 g=256 D=56 48.1 GFlops ( 1.45%) 192.4 GB/s (99.86%) limited by memory 99.86% algo=depthwise_separable
23 scale bwd-data b=64 k=1 p=0 s=1 in=256 out=256 g=256 D=56 28.6 GFlops ( 0.86%) 114.2 GB/s (59.28%) limited by memory 59.28% algo=depthwise_separable
23 scale bwd-filt b=64 k=1 p=0 s=1 in=256 out=256 g=256 D=56 49.7 GFlops ( 1.50%) 198.9 GB/s (103.23%) limited by memory 103.23% algo=depthwise_separable
24 scale forward b=64 k=1 p=0 s=1 in=1024 out=1024 g=1024 D=7 46.5 GFlops ( 1.40%) 186.2 GB/s (96.62%) limited by memory 96.62% algo=depthwise_separable
24 scale bwd-data b=64 k=1 p=0 s=1 in=1024 out=1024 g=1024 D=7 21.1 GFlops ( 0.64%) 84.5 GB/s (43.85%) limited by memory 43.85% algo=depthwise_separable
24 scale bwd-filt b=64 k=1 p=0 s=1 in=1024 out=1024 g=1024 D=7 10.2 GFlops ( 0.31%) 40.8 GB/s (21.18%) limited by memory 21.18% algo=depthwise_separable
Broadcast/Reduce
float (64,512,24,24) (64,512,24,24) (64,512,24,24) 50.0 GFlops ( 1.51%) 300.1 GB/s (155.73%) limited by memory 155.73%
float (64,512,24,24) (512,1,1) (64,512,24,24) 62.3 GFlops ( 1.88%) 249.3 GB/s (129.36%) limited by memory 129.36%
float (64,512,24,24) (1,512,1,1) (1,512,1,1) 43.4 GFlops ( 1.31%) 57.9 GB/s (30.03%) limited by memory 30.03%
float (64,512,24,24) (64,512,24,24) (1,512,1,1) 39.2 GFlops ( 1.18%) 104.6 GB/s (54.30%) limited by memory 54.30%
float (64,512,24,24) (64,512,24,24) (64,1,1,1) 93.0 GFlops ( 2.80%) 247.9 GB/s (128.66%) limited by memory 128.66%
float (256,1000) (256,1) (1) 19.7 GFlops ( 0.59%) 26.3 GB/s (13.65%) limited by memory 13.65%
long (64,512,24,24) (64,512,24,24) (64,512,24,24) 26.2 GFlops ( 0.79%) 314.9 GB/s (163.40%) limited by memory 163.40%
long (64,512,24,24) (512,1,1) (64,512,24,24) 37.8 GFlops ( 1.14%) 302.8 GB/s (157.11%) limited by memory 157.11%
long (64,512,24,24) (1,512,1,1) (1,512,1,1) 31.4 GFlops ( 0.94%) 83.6 GB/s (43.39%) limited by memory 43.39%
long (64,512,24,24) (64,512,24,24) (1,512,1,1) 15.7 GFlops ( 0.47%) 83.7 GB/s (43.44%) limited by memory 43.44%
long (64,512,24,24) (64,512,24,24) (64,1,1,1) 39.7 GFlops ( 1.20%) 212.0 GB/s (109.99%) limited by memory 109.99%
long (256,1000) (256,1) (1) 18.1 GFlops ( 0.54%) 48.3 GB/s (25.05%) limited by memory 25.05%
short (64,512,24,24) (64,512,24,24) (64,512,24,24) 76.9 GFlops ( 2.32%) 230.7 GB/s (119.73%) limited by memory 119.73%
short (64,512,24,24) (512,1,1) (64,512,24,24) 66.8 GFlops ( 2.01%) 133.6 GB/s (69.30%) limited by memory 69.30%
short (64,512,24,24) (1,512,1,1) (1,512,1,1) 43.7 GFlops ( 1.32%) 29.1 GB/s (15.12%) limited by memory 15.12%
short (64,512,24,24) (64,512,24,24) (1,512,1,1) 37.7 GFlops ( 1.14%) 50.3 GB/s (26.10%) limited by memory 26.10%
short (64,512,24,24) (64,512,24,24) (64,1,1,1) 114.7 GFlops ( 3.46%) 153.0 GB/s (79.38%) limited by memory 79.38%
short (256,1000) (256,1) (1) 20.5 GFlops ( 0.62%) 13.7 GB/s ( 7.09%) limited by memory 7.09%
davidlaxer@x86_64-apple-darwin13 build %
Can you build the dlprimitives (outside the pytorch) and run some tests to see that it works properly What other tests would you like me to run?
You are running on device 0:0 instead of 0:1
See:
5/33 Testing: test_test_case_conv2d
5/33 Test: test_test_case_conv2d
Command: "/Users/davidlaxer/pytorch_dlprim/dlprimitives/build/test_from_template" "0:0" "/Users/davidlaxer/pytorch_dlprim/dlprimitives/tests/test_case_conv2d.json"
Directory: /Users/davidlaxer/pytorch_dlprim/dlprimitives/build
"test_test_case_conv2d" start time: Jan 26 10:10 PST
Output:
----------------------------------------------------------
Running tests for operator Convolution2D on Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz on Apple
See I mentioned before
(Note you'll probably need to set cmake parameter something like TEST_DEV=1:0 for platform 1 device 0 (according to clinfo -l)
I think you need to rerun cmake .. -DTEST_DEV=0:1
And than run make test
% dlprim_flops 0:1 0.5 Testing on AMD Radeon Pro 5700 XT Compute Engine on Apple
What I see in the benchmark that it does not use Winograd convolution kernel and it is very important for resnet performance.
What I see in clinfo log
Device Name AMD Radeon Pro 5700 XT Compute Engine Device Vendor AMD Device Vendor ID 0x1021e00
When I check for wingorad compatibility I use:
static bool is_winograd_compatible(Context &ctx,Conv2DSettings const &config)
{
if(!ctx.is_amd() && !ctx.is_nvidia())
return false;
And to check AMD I do:
bool Context::is_amd()
{
if(is_cpu_context())
return false;
return device().getInfo<CL_DEVICE_VENDOR_ID>() == 0x1002;
//return device_extensions().find("cl_amd_") != std::string::npos;
}
While the vendor ID is clearly not the same...
Need to think how to fix it.
Can you try to change the line:
return device().getInfo<CL_DEVICE_VENDOR_ID>() == 0x1002;
To something like:
auto vendor_id = device().getInfo<CL_DEVICE_VENDOR_ID>() ;
return vendor_id == 0x1002 || vendor_id == 0x1021e00;
And than rerun flops to see that some of the kernels appear to use winograd convolution and not only gemm.
% make test
Running tests...
/opt/local/bin/ctest --force-new-ctest-process
Test project /Users/davidlaxer/pytorch_dlprim/dlprimitives/build
Start 1: test_test_case_abs
1/33 Test #1: test_test_case_abs ............... Passed 1.03 sec
Start 2: test_test_case_activation
2/33 Test #2: test_test_case_activation ........ Passed 1.36 sec
Start 3: test_test_case_batchnorm
3/33 Test #3: test_test_case_batchnorm ......... Passed 5.06 sec
Start 4: test_test_case_concat
4/33 Test #4: test_test_case_concat ............ Passed 0.11 sec
Start 5: test_test_case_conv2d
5/33 Test #5: test_test_case_conv2d ............ Passed 124.81 sec
Start 6: test_test_case_conv2d_dsc
6/33 Test #6: test_test_case_conv2d_dsc ........ Passed 35.64 sec
Start 7: test_test_case_conv2d_gemm
7/33 Test #7: test_test_case_conv2d_gemm ....... Passed 40.40 sec
Start 8: test_test_case_conv2d_win
8/33 Test #8: test_test_case_conv2d_win ........ Passed 35.59 sec
Start 9: test_test_case_elementwise
9/33 Test #9: test_test_case_elementwise ....... Passed 6.52 sec
Start 10: test_test_case_global_pooling
10/33 Test #10: test_test_case_global_pooling .... Passed 5.56 sec
Start 11: test_test_case_hardtanh
11/33 Test #11: test_test_case_hardtanh .......... Passed 0.58 sec
Start 12: test_test_case_inner_product
12/33 Test #12: test_test_case_inner_product ..... Passed 14.59 sec
Start 13: test_test_case_log_softmax
13/33 Test #13: test_test_case_log_softmax ....... Passed 0.31 sec
Start 14: test_test_case_mse_loss
14/33 Test #14: test_test_case_mse_loss .......... Passed 0.32 sec
Start 15: test_test_case_nll_loss
15/33 Test #15: test_test_case_nll_loss .......... Passed 0.38 sec
Start 16: test_test_case_param
16/33 Test #16: test_test_case_param ............. Passed 0.10 sec
Start 17: test_test_case_pooling2d
17/33 Test #17: test_test_case_pooling2d ......... Passed 69.22 sec
Start 18: test_test_case_reduction
18/33 Test #18: test_test_case_reduction ......... Passed 19.95 sec
Start 19: test_test_case_slice
19/33 Test #19: test_test_case_slice ............. Passed 0.07 sec
Start 20: test_test_case_softmax
20/33 Test #20: test_test_case_softmax ........... Passed 0.44 sec
Start 21: test_test_case_softmax_loss
21/33 Test #21: test_test_case_softmax_loss ...... Passed 0.24 sec
Start 22: test_test_case_threshold
22/33 Test #22: test_test_case_threshold ......... Passed 0.55 sec
Start 23: test_test_case_tr_conv2d
23/33 Test #23: test_test_case_tr_conv2d ......... Passed 40.02 sec
Start 24: test_test_case_tr_conv2d_dsc
24/33 Test #24: test_test_case_tr_conv2d_dsc ..... Passed 2.00 sec
Start 25: test_test_case_tr_conv2d_gemm
25/33 Test #25: test_test_case_tr_conv2d_gemm .... Passed 2.99 sec
Start 26: test_test_case_tr_conv2d_win
26/33 Test #26: test_test_case_tr_conv2d_win ..... Passed 2.00 sec
Start 27: test_net
27/33 Test #27: test_net ......................... Passed 1.69 sec
Start 28: test_net_nonopt
28/33 Test #28: test_net_nonopt .................. Passed 0.09 sec
Start 29: test_json
29/33 Test #29: test_json ........................ Passed 0.13 sec
Start 30: test_random
30/33 Test #30: test_random ...................... Passed 0.16 sec
Start 31: test_context
31/33 Test #31: test_context ..................... Passed 0.20 sec
Start 32: test_util
32/33 Test #32: test_util ........................ Passed 10.22 sec
Start 33: test_broadcast_reduce
33/33 Test #33: test_broadcast_reduce ............ Passed 7.90 sec
100% tests passed, 0 tests failed out of 33
Total Test time (real) = 430.26 sec
davidlaxer@x86_64-apple-darwin13 build %
... and after editing context.cpp:
bool Context::is_amd()
{
if(is_cpu_context())
return false;
auto vendor_id = device().getInfo<CL_DEVICE_VENDOR_ID>() ;
return vendor_id == 0x1002 || vendor_id == 0x1021e00;
//return device_extensions().find("cl_amd_") != std::string::npos;
}
Test
% make test
Running tests...
/opt/local/bin/ctest --force-new-ctest-process
Test project /Users/davidlaxer/pytorch_dlprim/dlprimitives/build
Start 1: test_test_case_abs
1/33 Test #1: test_test_case_abs ............... Passed 0.76 sec
Start 2: test_test_case_activation
2/33 Test #2: test_test_case_activation ........ Passed 1.17 sec
Start 3: test_test_case_batchnorm
3/33 Test #3: test_test_case_batchnorm ......... Passed 4.82 sec
Start 4: test_test_case_concat
4/33 Test #4: test_test_case_concat ............ Passed 0.08 sec
Start 5: test_test_case_conv2d
5/33 Test #5: test_test_case_conv2d ............***Failed 2.18 sec
Start 6: test_test_case_conv2d_dsc
6/33 Test #6: test_test_case_conv2d_dsc ........ Passed 58.13 sec
Start 7: test_test_case_conv2d_gemm
7/33 Test #7: test_test_case_conv2d_gemm ....... Passed 34.97 sec
Start 8: test_test_case_conv2d_win
8/33 Test #8: test_test_case_conv2d_win ........***Failed 0.54 sec
Start 9: test_test_case_elementwise
9/33 Test #9: test_test_case_elementwise ....... Passed 0.63 sec
Start 10: test_test_case_global_pooling
10/33 Test #10: test_test_case_global_pooling .... Passed 4.06 sec
Start 11: test_test_case_hardtanh
11/33 Test #11: test_test_case_hardtanh .......... Passed 0.53 sec
Start 12: test_test_case_inner_product
12/33 Test #12: test_test_case_inner_product ..... Passed 17.31 sec
Start 13: test_test_case_log_softmax
13/33 Test #13: test_test_case_log_softmax ....... Passed 0.16 sec
Start 14: test_test_case_mse_loss
14/33 Test #14: test_test_case_mse_loss .......... Passed 0.08 sec
Start 15: test_test_case_nll_loss
15/33 Test #15: test_test_case_nll_loss .......... Passed 0.24 sec
Start 16: test_test_case_param
16/33 Test #16: test_test_case_param ............. Passed 0.07 sec
Start 17: test_test_case_pooling2d
17/33 Test #17: test_test_case_pooling2d ......... Passed 68.12 sec
Start 18: test_test_case_reduction
18/33 Test #18: test_test_case_reduction ......... Passed 16.88 sec
Start 19: test_test_case_slice
19/33 Test #19: test_test_case_slice ............. Passed 0.10 sec
Start 20: test_test_case_softmax
20/33 Test #20: test_test_case_softmax ........... Passed 0.18 sec
Start 21: test_test_case_softmax_loss
21/33 Test #21: test_test_case_softmax_loss ...... Passed 0.14 sec
Start 22: test_test_case_threshold
22/33 Test #22: test_test_case_threshold ......... Passed 0.49 sec
Start 23: test_test_case_tr_conv2d
23/33 Test #23: test_test_case_tr_conv2d .........***Failed 0.75 sec
Start 24: test_test_case_tr_conv2d_dsc
24/33 Test #24: test_test_case_tr_conv2d_dsc ..... Passed 7.31 sec
Start 25: test_test_case_tr_conv2d_gemm
25/33 Test #25: test_test_case_tr_conv2d_gemm .... Passed 2.36 sec
Start 26: test_test_case_tr_conv2d_win
26/33 Test #26: test_test_case_tr_conv2d_win .....***Failed 0.79 sec
Start 27: test_net
27/33 Test #27: test_net .........................Subprocess aborted***Exception: 0.30 sec
Start 28: test_net_nonopt
28/33 Test #28: test_net_nonopt ..................Subprocess aborted***Exception: 0.06 sec
Start 29: test_json
29/33 Test #29: test_json ........................ Passed 0.13 sec
Start 30: test_random
30/33 Test #30: test_random ...................... Passed 0.17 sec
Start 31: test_context
31/33 Test #31: test_context ..................... Passed 0.16 sec
Start 32: test_util
32/33 Test #32: test_util ........................ Passed 9.08 sec
Start 33: test_broadcast_reduce
33/33 Test #33: test_broadcast_reduce ............ Passed 1.35 sec
82% tests passed, 6 tests failed out of 33
Total Test time (real) = 234.11 sec
The following tests FAILED:
5 - test_test_case_conv2d (Failed)
8 - test_test_case_conv2d_win (Failed)
23 - test_test_case_tr_conv2d (Failed)
26 - test_test_case_tr_conv2d_win (Failed)
27 - test_net (Subprocess aborted)
28 - test_net_nonopt (Subprocess aborted)
Errors while running CTest
Output from these tests are in: /Users/davidlaxer/pytorch_dlprim/dlprimitives/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [test] Error 8
davidlaxer@x86_64-apple-darwin13 build %
Will pytorch_dlprim run on MacOS 13.2 Ventura? I have an AMD Radeon Pro 5700XT GPU. Does OpenCL support torch.float64, torch.cfloat data types?