is training supported only on the CPU, not yet on the GPU?

Platform(Include target platform as well if cross-compiling):

Mac (macOS version 10.14.5) deviceName: Intel(R) Iris(TM) Plus Graphics 640 deviceVersion: OpenCL 1.2

Github Version:

81df3a4 (git pull from the master branch. git comment: "Fix compile bug for old gcc version")

Compiling Method

cmake

cd $PATH_TO_MNN_ROOT/build
cmake .. \
         -DMNN_VULKAN:BOOL=OFF \
             -DMNN_OPENCL:BOOL=ON \
             -DMNN_OPENMP:BOOL=ON \
             -DMNN_OPENGL:BOOL=OFF \
         -DMNN_DEBUG:BOOL=ON \
         -DMNN_BUILD_TRAIN:BOOL=ON \
         -DMNN_BUILD_TRAIN_MINI:BOOL=OFF \
         -DMNN_USE_OPENCV:BOOL=OFF \
         -DMNN_BUILD_BENCHMARK:BOOL=ON \
         -DNATIVE_LIBRARY_OUTPUT=.
    make -j4 runTrainDemo.out

-- >>>>>>>>>>>>>
-- MNN BUILD INFO:
--  System: Darwin
--  Processor: x86_64
--  Metal: OFF
--  OpenCL: ON
--  OpenGL: OFF
--  Vulkan: OFF
--  ARM82: OFF
--  TensorRT: OFF
--  CUDA: OFF
--  OpenMP: OFF
--  Hidden: TRUE
--  Build Path: /Users/ydkwon/2ndData/self_study/tinyML/ref_proj/MNN_201123/build
-- x86_64: Open SSE
-- Onnx: /Users/ydkwon/2ndData/self_study/tinyML/ref_proj/MNN_201123/build/tools/converter/onnx.pb.h;/Users/ydkwon/2ndData/self_study/tinyML/ref_proj/MNN_201123/build/tools/converter/onnx-operators.pb.h
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/ydkwon/2ndData/self_study/tinyML/ref_proj/MNN_201123/build
[  0%] Built target MNNUtils
[  2%] Built target MNNAVX
[  6%] Built target MNNCore
[ 28%] Built target MNNCPU
[ 30%] Built target MNNSSE
[ 36%] Built target GenVCSHDR
[ 36%] Built target MNNCompute
[ 36%] Built target MNNCV
[ 38%] Built target MNNMath
[ 58%] Built target MNNX8664
[ 70%] Built target MNNTransform
[ 70%] Built target MNN
[ 76%] Built target MNN_Express
[ 86%] Built target MNN_CL
Scanning dependencies of target MNNTrain
[ 88%] Linking CXX shared library libMNNTrain.dylib
[ 96%] Built target MNNTrain
Scanning dependencies of target runTrainDemo.out
[ 96%] Building CXX object tools/train/CMakeFiles/runTrainDemo.out.dir/source/demo/MnistUtils.cpp.o
[ 96%] Linking CXX executable ../../runTrainDemo.out
[100%] Built target runTrainDemo.out

Build Log:

At first, I tested training the provided MnistV2 model in "tools/train/source/demo/mnistTrain.cpp" with the CPU backend using the command below. It works just fine since I can see the loss is decreasing. ./runTrainDemo.out MnistTrain /path/to/unzipped/mnist/data/

Error for open mnist.snapshot.mnn
Error parameters, empty or parameter size not match 
Error to find creator of 9, set CPU default
epoch: 0  640 / 60000 loss: 2.09259 lr: 0.00999326 time: 354.783 ms / 9 iter
epoch: 0  1280 / 60000 loss: 1.64925 lr: 0.00998577 time: 315.211 ms / 10 iter
epoch: 0  1920 / 60000 loss: 1.02414 lr: 0.0099783 time: 385.17 ms / 10 iter
epoch: 0  2560 / 60000 loss: 1.03541 lr: 0.00997085 time: 463.84 ms / 10 iter
epoch: 0  3200 / 60000 loss: 0.714723 lr: 0.00996341 time: 527.701 ms / 10 iter
epoch: 0  3840 / 60000 loss: 0.497747 lr: 0.00995598 time: 477.038 ms / 10 iter
epoch: 0  4480 / 60000 loss: 0.588594 lr: 0.00994856 time: 467.815 ms / 10 iter
epoch: 0  5120 / 60000 loss: 0.531522 lr: 0.00994116 time: 548.745 ms / 10 iter
...
epoch: 0  59520 / 60000 loss: 0.0205166 lr: 0.00935545 time: 414.481 ms / 10 iter
epoch: 0  60000 / 60000 loss: 0.0804286 lr: 0.00935032 time: 254.758 ms / 8 iter
train, 69, cost time: 49807.082031 ms
test: 2000 / 10000
test: 4000 / 10000
test: 6000 / 10000
test: 8000 / 10000
test: 10000 / 10000
epoch: 0  accuracy: 0.9773

However, after I switched the backend from CPU to OpenCL, it is not working as you can see logs below. FYI, to switch the backend, I simply changed one line of code in "tools/train/source/demo/MnistUtils.cpp" like this. (Before) CPU backend (Line: 39): exe->setGlobalExecutorConfig(MNN_FORWARD_CPU, backendConfig, 4); (After) OpenCL backend: exe->setGlobalExecutorConfig(MNN_FORWARD_OPENCL, backendConfig, 4);

Don't support type Cast
input n:25, h:1, w:1, c:36864
input n:20, h:1, w:1, c:25
input n:1, h:1, w:1, c:20
output n:36864, h:1, w:1, c:20
beyond cl_image creat size! fallback to cpu backend
input n:64, h:24, w:24, c:20
output n:64, h:24, w:24, c:20
beyond cl_image creat size! fallback to cpu backend
Don't support type Cast
Don't support type OneHot
The Creator Don't support type BinaryOp
Don't support type Cast
The Creator Don't support type BinaryOp
Don't support type Cast

So, is training currently supported only on the CPU, and training on the GPU still under development? Or have I done something wrong with a configuration?

Thank you in advance.

p.s. I tested training on a smartphone as well (Pixel 4, Android ver. 10, OpenCL 2.0 Adreno(TM) 640). I can see the same problem there too (i.e., training on CPU works well, but it shows the same errors above when running on GPU).

alibaba / MNN