alibaba / MNN

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
http://www.mnn.zone/
8.71k stars 1.67k forks source link

opencl推理报错: CL ERROR CODE : -9999, info:commandQueue CL ERROR CODE : -36, info:conv_2d_c4h1w1256_1_1_1_1_1_1 CL ERROR CODE : -58, info:clEvent #2613

Closed fanjing8 closed 9 months ago

fanjing8 commented 1 year ago

平台(如果交叉编译请再附上交叉编译目标平台):

Platform(Include target platform as well if cross-compiling):

Ubuntu20.04, x64

Github版本:

Github Version:

2.7.1

编译方式:

Compiling Method

git checkout -b v2.7.1 
mkdir buildv2.7.1
cd buildv2.7.1
cmake .. -DMNN_OPENCL=ON -DMNN_VULKAN=ON -DMNN_SEP_BUILD=OFF  
make -j8
make install DESTDIR=./install

编译日志:

Build Log:

...
[ 92%] Built target GetMNNInfo
[ 92%] Built target ModuleBasic.out
[ 93%] Built target SequenceModuleTest.out
[ 94%] Built target mergeInplaceForCPU
[ 94%] Built target MNNV2Basic.out
[ 94%] Built target mobilenetTest.out
[ 94%] Built target backendTest.out
[ 95%] Built target testModel.out
[ 96%] Built target testModel_expr.out
[ 96%] Built target testModelWithDescribe.out
[ 96%] Built target getPerformance.out
[ 97%] Built target checkInvalidValue.out
[ 98%] Built target timeProfile.out
[ 98%] Built target testTrain.out
[ 99%] Built target checkDir.out
[ 99%] Built target checkFile.out
[100%] Built target winogradExample.out
Install the project...
-- Install configuration: ""
-- Installing: ./install/usr/local/include/MNN/MNNDefine.h
-- Installing: ./install/usr/local/include/MNN/Interpreter.hpp
-- Installing: ./install/usr/local/include/MNN/HalideRuntime.h
-- Installing: ./install/usr/local/include/MNN/Tensor.hpp
-- Installing: ./install/usr/local/include/MNN/ErrorCode.hpp
-- Installing: ./install/usr/local/include/MNN/ImageProcess.hpp
-- Installing: ./install/usr/local/include/MNN/Matrix.h
-- Installing: ./install/usr/local/include/MNN/Rect.h
-- Installing: ./install/usr/local/include/MNN/MNNForwardType.h
-- Installing: ./install/usr/local/include/MNN/AutoTime.hpp
-- Installing: ./install/usr/local/include/MNN/MNNSharedContext.h
-- Installing: ./install/usr/local/include/MNN/expr/Expr.hpp
-- Installing: ./install/usr/local/include/MNN/expr/ExprCreator.hpp
-- Installing: ./install/usr/local/include/MNN/expr/MathOp.hpp
-- Installing: ./install/usr/local/include/MNN/expr/NeuralNetWorkOp.hpp
-- Installing: ./install/usr/local/include/MNN/expr/Optimizer.hpp
-- Installing: ./install/usr/local/include/MNN/expr/Executor.hpp
-- Installing: ./install/usr/local/include/MNN/expr/Module.hpp
-- Up-to-date: ./install/usr/local/include/MNN/expr/NeuralNetWorkOp.hpp
-- Installing: ./install/usr/local/include/MNN/expr/ExecutorScope.hpp
-- Installing: ./install/usr/local/include/MNN/expr/Scope.hpp
-- Installing: ./install/usr/local/lib/libMNN.so

opencl推理报错

c++ session接口运行mnn模型(主要是transformer module), 调用opencl推理,因为每次推理都是动态shape,所以每次infer都需要resizeSession,一般第一次resizeSession时候不会报以下错误,但是会在后面某次infer做resizeSession时候报错,错误信息如下:

CL ERROR CODE : -9999, info:commandQueue 
CL ERROR CODE : -36, info:conv_2d_c4h1w1256_1_1_1_1_1_1 
CL ERROR CODE : -58, info:clEvent 
CL ERROR CODE : -36, info:conv_2d_c4h1w2256_1_1_1_1_1_1 
CL ERROR CODE : -58, info:clEvent 
CL ERROR CODE : -36, info:conv_2d_c4h4w1256_1_1_1_1_1_1 
CL ERROR CODE : -58, info:clEvent 
CL ERROR CODE : -36, info:conv_2d_c8h2w1256_1_1_1_1_1_1 
CL ERROR CODE : -58, info:clEvent 
CL ERROR CODE : -36, info:conv_2d_c8h4w1256_1_1_1_1_1_1 
CL ERROR CODE : -58, info:clEvent 
CL ERROR CODE : -36, info:conv_2d_c4h1w4256_1_1_1_1_1_1 
CL ERROR CODE : -58, info:clEvent 
...
CL ERROR CODE : -36, info:Raster 
CL ERROR CODE : -36, info:Raster 
CL ERROR CODE : -36, info:Raster 
CL ERROR CODE : -36, info:BinaryOp 
CL ERROR CODE : -36, info:run3d 
CL ERROR CODE : -36, info:Raster 
CL ERROR CODE : -36, info:Raster 
CL ERROR CODE : -36, info:Raster 
CL ERROR CODE : -36, info:nc4hw4_buffer_to_nchw_buffer 
CL ERROR CODE : -36, info:nc4hw4_buffer_to_nhwc_buffer 
Segmentation fault

内嵌的 int32_t推理错误(opencl和cuda都有这个问题)

debug发现第一次推理时候opencl输出(注意输出tensor是int32_t类型的)已经不对了。mnnTensor->print()输出结果如下,但这个结果是错误的,正确的应该是-1和不太大的整数(如小于1000),这里输出的-2147483648, -2147483648肉眼可见的错误。

Dimension: 1, 2525, 
Data: -2147483648, 0, -2147483648, 0, 0, 0, 0, 0, 0, 0, 279, 0, 0, 0, 333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ....

others

CPU和vulkan推理,以上两个问题都没有。但是vulkan巨慢。 python版本的opengl和cuda也有以上的问题。 尝试回退到2.6.2版本(参考)以上问题任然存在。

bitxsw93 commented 1 year ago

opencl需要设置成buffer模型(mode=65)。报错可能是显存不足导致的,检查下运行程序,内存占用情况

fanjing8 commented 1 year ago

opencl需要设置成buffer模型(mode=65)。报错可能是显存不足导致的,检查下运行程序,内存占用情况

mode设置65尝试过了,不work。

显存使用25%不到,内存也够用。

DeepSpace98 commented 1 year ago

试试使用opencl的时候把numthread改成1

fanjing8 commented 1 year ago

试试使用opencl的时候把numthread改成1

  1. opencl推理的问题,初测下来已经解决,万分感谢!但为什么这么改是可以的?大佬能解释一下吗?或者有什么参考资料?

  2. 使用cuda的推理的问题仍然存在。

github-actions[bot] commented 9 months ago

Marking as stale. No activity in 60 days.