NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

Slow in deploying SE-ResNeXt-100 #569

Closed whria78 closed 5 years ago

whria78 commented 5 years ago

Environment

Ubuntu 16.04 1050ti and 1080ti CUDA 10.1 / cudnn 7.5 Test code and pretrained Model https://1drv.ms/u/s!Al4S1efzlWvXh7hQ6eD4B530c8UCKA?e=2cOvxV

Problem FP16 training : nvcaffe 0.17.2 > FP32 BVLC caffe with axpy layer patch --> faster than BVLC FP16 testing (deploying) : BVLC caffe >>> nvcaffe 0.17.2 (0.2 sec) --> slow 10 times.

I also tested with /nvcaffe/python/classifier.py , but it also slow too.

import os import caffe from caffe.proto import caffe_pb2 import numpy as np import cv2 import time import sys

train_dataset='senext100FP16'

train_dataset='senext100FP32'

train_dataset='vgg'

test_path=os.path.join(os.getcwd(),'test')

IMAGE_WIDTH=224 IMAGE_HEIGHT=224

def transform_img(img, img_width=IMAGE_WIDTH, img_height=IMAGE_HEIGHT):

Histogram Equalization

img[:, :, 0] = cv2.equalizeHist(img[:, :, 0])
img[:, :, 1] = cv2.equalizeHist(img[:, :, 1])
img[:, :, 2] = cv2.equalizeHist(img[:, :, 2])
return transform_img2(img, img_width, img_height)
return img

def transform_img2(img, img_width=IMAGE_WIDTH, img_height=IMAGE_HEIGHT):

Image Resizing

img = cv2.resize(img, (img_width, img_height), interpolation = cv2.INTER_CUBIC)
return img

caffe.set_mode_gpu()

mean_blob = caffe_pb2.BlobProto() with open(os.path.join(os.getcwd(),'model',train_dataset,'mean224x224.binaryproto'),'rb') as f: mean_blob.ParseFromString(f.read()) mean_array = np.asarray(mean_blob.data, dtype=np.float32).reshape( (mean_blob.channels, mean_blob.height, mean_blob.width)) net = caffe.Net(os.path.join(os.getcwd(),'model',train_dataset,'deploy.prototxt'),os.path.join(os.getcwd(),'model',train_dataset,'model.caffemodel'),caffe.TEST) transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape}) transformer.set_mean('data', mean_array) transformer.set_transpose('data', (2,0,1))

for root,dirs,files in os.walk(os.path.join(os.getcwd(),'test')): for fname in files: ext=(os.path.splitext(fname)[-1]).lower() if ext == ".jpg" or ext == ".jpeg" or ext == ".gif" or ext == ".png" : img_path=os.path.join(root,fname) img_org = cv2.imread(img_path, cv2.IMREAD_COLOR) img = transform_img(img_org, img_width=224, img_height=224) starttime=time.time() net.blobs['data'].data[...] = transformer.preprocess('data', img) out = net.forward() print time.time()-starttime

whria78 commented 5 years ago

FP16 model : always slow FP32 model : If I set default_backward_math from FLOAT to FLOAT16, it shows the slowness.

whria78 commented 5 years ago

If I change FLOAT16 to FLOAT in the deploy.prototxt of the FP16 model, the FP16 model recover the speed. But the use of FLOAT on the FP16 model does not look appropriate.

default_forward_math : FLOAT default_backward_math : FLOAT16

drnikolaev commented 5 years ago

Hi @whria78 seems like not all layers you use are implemented for GPU. Unfortunately, FLOAT16 runs very slow on CPU, this is expected.

whria78 commented 5 years ago

@drnikolaev Thank you for the reply.

As you said, I changed "engine: CAFFE" to "engine: CUDNN". But it also make the problem.

And Today, I found that this problem occur only with Pascal GPUs (1080ti, 1050ti, 1070).

I tested my 2080 GPU, there was no problem.

Is there any chance that there may be some problem in checking GPU type for the special acceleration of Volta GPU?

drnikolaev commented 5 years ago

@whria78 engine: CAFFE actually stands for GPU if implemented, CPU otherwise whereas engine: CUDNN stands for GPU + CUDNN. As of different nets - try to increase batch size up to the limit and notice some of them getting hot (depending on data rate consumed). Fore example, old good AlexNet is the most data hungry.