very slow on hd graphics 5000

anguoyang commented 6 years ago

Hi@ganyc717 Thank you for your project, I have tested on my intel hd graphics 5000, it is very slow, about 8 seconds for 1 single image, e.g, the dog.jpg I have never modified your source code.

ganyc717 commented 6 years ago

Hi@anguoyang As far as I am concerned, the performance issue is heavily depend on the hardware. For this project, I tested on GTX 970, the work that yolo.cfg to detect dog.jpg spend about 0.18 seconds. About double time cost compared with the origin cuda project on the same hardware(GTX 970). Thank you for giving me this ticket.

AndrewSivrit commented 6 years ago

Hi@anguoyang What mode did you use ? Debug or Release ? Check Release mode in Visual Studio.

anguoyang commented 6 years ago

hi@AndrewSivrit, I used Release mode in vs, thank you

anguoyang commented 6 years ago

hi@ganyc717 , yes maybe, but I want to use intel GPU instead of nvidia, which is cos-efficient for production. thank you for your quick reply.

anguoyang commented 6 years ago

double time cost compared to cuda is acceptable and reasonable, however, my test result is...really slow, almost hundreds over cuda(similar hardware), so I suppose there maybe something wrong with my program?

anguoyang commented 6 years ago

D:\Darknet-On-OpenCL\x64\Release>darknet_cl detect cfg/yolo.cfg yolo.weights data/dog.jpg layer filters size input output 0 conv 32 3 x 3 / 1 608 x 608 x 3 -> 608 x 608 x 32 1 blas_kernels_1.cl build log: 1:82:37: warning: double precision constant requires cl_khr_fp64, casting to single precision 1:82:58: warning: double precision constant requires cl_khr_fp64, casting to single precision fcl build 1 succeeded. fcl build 2 succeeded. bcl build succeeded.

max 2 x 2 / 2 608 x 608 x 32 -> 304 x 304 x 32 2 conv 64 3 x 3 / 1 304 x 304 x 32 -> 304 x 304 x 64 3 max 2 x 2 / 2 304 x 304 x 64 -> 152 x 152 x 64 4 conv 128 3 x 3 / 1 152 x 152 x 64 -> 152 x 152 x 128 5 conv 64 1 x 1 / 1 152 x 152 x 128 -> 152 x 152 x 64 6 conv 128 3 x 3 / 1 152 x 152 x 64 -> 152 x 152 x 128 7 max 2 x 2 / 2 152 x 152 x 128 -> 76 x 76 x 128 8 conv 256 3 x 3 / 1 76 x 76 x 128 -> 76 x 76 x 256 9 conv 128 1 x 1 / 1 76 x 76 x 256 -> 76 x 76 x 128 10 conv 256 3 x 3 / 1 76 x 76 x 128 -> 76 x 76 x 256 11 max 2 x 2 / 2 76 x 76 x 256 -> 38 x 38 x 256 12 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512 13 conv 256 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 256 14 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512 15 conv 256 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 256 16 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512 17 max 2 x 2 / 2 38 x 38 x 512 -> 19 x 19 x 512 18 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024 19 conv 512 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 512 20 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024 21 conv 512 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 512 22 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024 23 conv 1024 3 x 3 / 1 19 x 19 x1024 -> 19 x 19 x1024 24 conv 1024 3 x 3 / 1 19 x 19 x1024 -> 19 x 19 x1024 25 route 16 26 conv 64 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 64 27 reorg / 2 38 x 38 x 64 -> 19 x 19 x 256 28 route 27 24 29 conv 1024 3 x 3 / 1 19 x 19 x1280 -> 19 x 19 x1024 30 conv 425 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 425 31 detection mask_scale: Using default '1.000000' Loading weights from yolo.weights...Done! im2col_kernels.cl build log: 2:36:18: warning: '/*' within block comment fcl build 1 succeeded. fcl build 2 succeeded. bcl build succeeded.

activation_kernels.cl build log: 4:21:12: warning: double precision constant requires cl_khr_fp64, casting to single precision fcl build 1 succeeded. fcl build 2 succeeded. bcl build succeeded.

maxpool_layer_kernels.cl build log: fcl build 1 succeeded. fcl build 2 succeeded. bcl build succeeded.

blas_kernels_2.cl build log: fcl build 1 succeeded. fcl build 2 succeeded. bcl build succeeded.

data/dog.jpg: Predicted in 7.806060 seconds. dog: 82% car: 28% truck: 64% bicycle: 85%

ganyc717 commented 6 years ago

Hi @anguoyang I have tested on my laptop with intel HD 4600, seems the majority of kernel time spend on sgemm function, this is BLAS function, and I suggest not modify this. But I noticed that clBLAS have special optimization with AMD GPU, and didn't include it in this repo, you may change another GPU and tried again. Or just choose a smaller scale of network like tiny-yolo. Best Regards!

victorv commented 6 years ago

OpenCL performance is not platform independent so you would need to tune any CL code to the target platform to avoid register spilling, local memory overflow, etc..

ganyc717 / Darknet-On-OpenCL

very slow on hd graphics 5000 #1