dmlc / keras

Deep Learning library for Python. Convnets, recurrent neural networks, and more. Runs on MXNet, Theano or TensorFlow.
http://keras.io/
Other
125 stars 34 forks source link

No training speed improvement can be obtained by using multi-gpus with mxnet as the backend #79

Open Wendison opened 7 years ago

Wendison commented 7 years ago

Hi, I have some questions about the training speed when using multi-gpus with mxnet as the backend for keras. According to https://mxnet.incubator.apache.org/how_to/multi_devices.html, which said "By default, MXNet partitions a data batch evenly among the available GPUs. Assume a batch size b and assume there are k GPUs, then in one iteration each GPU will perform forward and backward on b/k examples. The gradients are then summed over all GPUs before updating the model." I think when the batch size b is fixed, each gpu calculates gradients on b/k examples, compared to the gradients calculation on b examples with single gpu, the former should comsume less time. As a result, with the same batch size, the speed of weights updating by using multi-gpus should be faster than that by using single gpu for each iteration. But through the experiments, I found the speed of training using multi-gpus is slower than that using single gpu .

below are parts of my code, where I used the fully-connected network

model = Sequential() model.add(Dropout(0.1,input_shape=(2056,))) model.add(Dense(2800,activation='relu')) model.add(Dropout(0.1)) model.add(Dense(2800,activation='relu')) model.add(Dropout(0.1)) model.add(Dense(2800,activation='relu')) model.add(Dropout(0.1)) model.add(Dense(257)) model.summary() opt=SGD() NUM_GPU = 4 gpu_list = [] for i in range(NUM_GPU): gpu_list.append('gpu(%d)' % i) batch_size=128 model.compile(loss=my_loss, optimizer=opt, context=gpu_list)

I don't know whether my understanding is right, why no speed improvment can be obatined with multi-gpus? Can anyone solve my questions? Thanks!

Wendison commented 7 years ago

Below is the training process with 1 gpu and 4 gpus respectively, 1 gpu: x6m d d shd hw7nc b 4 gpus: yhu d969fc2t_crzpvwyc5j It seems that the training with 4 gpus has faster convergence speed, but requires more time for each epoch.

kevinthesun commented 7 years ago

Can you provide full codes for your experiment? Sometimes multi-cpu won't get any boost or can even slow down training since overhead of hardware communication.

Wendison commented 7 years ago

Ok, my code is shown as follows:

import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.optimizers import SGD
import numpy as np
from sklearn import preprocessing
import random
from keras import backend as K 

def my_loss(y_true,y_pred):
    term1=K.sum(K.square(y_pred[:,:257]-y_true[:,:257]),axis=-1)
    term2=K.sum(K.square(y_pred[:,257:350]-y_true[:,257:350]),axis=-1)
    term3=K.sum(K.square(y_pred[:,350:]-y_true[:,350:]),axis=-1)
    return 0.5*term1+0.3*term2+0.2*term3

data_dir='/work/Wendison/training_data/'
NameX=[]
NameY=[]
Numxy=[]
##As the training data is too big(>100G), I divided it into 20 file pairs (input+label)
for j in range(1,21):  
    NameX.append(data_dir+'Xtrain'+str(j)+'.npy') # the path for input data of DNN
    NameY.append(data_dir+'Ytrain'+str(j)+'.npy') # the path for label data of DNN
    Numxy.append(data_dir+'Num'+str(j)+'.npy') # the path for number of samples for each file

meanx=np.load('meanx.npy')
stdx=np.load('stdx.npy')
meany=np.load('meany.npy')
stdy=np.load('stdy.npy')

scalerx=preprocessing.StandardScaler()
scalery=preprocessing.StandardScaler()
scalerx.mean_=meanx
scalerx.scale_=stdx
scalery.mean_=meany
scalery.scale_=stdy

##use the last data pair as the validation data
tempx=np.load(NameX[-1])
tempy=np.load(NameY[-1])
X_val=scalerx.transform(tempx)
Y_val=scalery.transform(tempy)
NameX.pop()
NameX.pop()
Numxy.pop()

batch_size=128
Num=len(Numxy)
numall=0
for i in range(len(Numxy)):
    nn=np.load(Numxy[i])
    numall+=sum(nn) # compute the number of overall training samples

##define a data generator to read training data
def mygenerator(batch_size=batch_size,num=30):
    num=range(Num)
    random.shuffle(num) # shuffle the order of training files
    for i in num:
        tempx=np.load(NameX[i])
        tempy=np.load(NameY[i])
        X_train=scalerx.transform(tempx)
        Y_train=scalery.transform(tempy)
        numxy=np.load(Numxy[i])
        orde=range(X_train.shape[0])
        random.shuffle(orde)
        X_train=X_train[orde,:]
        Y_train=Y_train[orde,:] # shuffle the order of samples in each data file
        numb=numxy/batch_size
        while 1:
            for ii in range(numb):
                if ii<numb-1:
                    yield X_train[ii*batch_size:(ii+1)*batch_size,:], Y_train[ii*batch_size:(ii+1)*batch_size,:]
                else:
                    yield X_train[batch_size*ii:,:],Y_train[batch_size*ii:,:]

##model definition
model = Sequential()
model.add(Dropout(0.1,input_shape=(2056,)))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(607))
model.summary()

opt=SGD()

NUM_GPU = 4
gpu_list = []
for i in range(NUM_GPU):
    gpu_list.append('gpu(%d)' % i)

model.compile(loss=my_loss,optimizer=opt, context=gpu_list)

mygen=mygenerator()
for i in range(1,101):
    model.fit_generator(mygen,samples_per_epoch=numall, nb_epoch=1, verbose=1, 
                        validation_data=(X_val, Y_val))

The training data is very large(>100G), so I divide it into 20 file pairs, and load the data periodically for each epoch by using the generator of keras, is that related to the speed of training via multi-gpus? Thanks! @kevinthesun

kevinthesun commented 7 years ago

@Wendison You can benchmark pure training time without data IO to see if data IO is the bottleneck.