Closed dbsxdbsx closed 6 years ago
Hi @dbsxdbsx, thanks for posting demo code. Looking quickly at your code I would guess you're correct, it's probably data iterator related. If you run htop while training do you see one thread with near 100% cpu usage?
@KellenSunderland ,thanks for your answer. This time , I change to another GPU EC2 called p2.xlarge. Anyway , this change would not make big difference.
As you can see , there are 6 threads running this script, one shows 101---I guess it means over 100% usage? And without exception, the gpu usage is still low.
and on contrary, running the mnist example below would give over 50% of GPU usage, and it really depends on batch_size:
import mxnet as mx
import argparse
def parse_args(description):
parser = argparse.ArgumentParser(description=description)
parser.add_argument('--batch_size', dest='batch_size', type=int, default=8)
parser.add_argument('--train_exp_num', dest='train_exp_num', type=int, default=2000)
parser.add_argument('--epoch_num', dest='epoch_num', type=int, default=50)
parser.add_argument('--gpu_num', dest='gpu_num', type=int, default=1)
parser.add_argument('--lr', dest='lr', type=float, default=0.00075)
args = parser.parse_args()
return args
if __name__ == '__main__':
args=parse_args('for mnist')
mnist = mx.test_utils.get_mnist()
batch_size =args.batch_size #100
train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)
data = mx.sym.var('data')
# first conv layer
conv1 = mx.sym.Convolution(data=data, kernel=(5, 5), num_filter=20)
tanh1 = mx.sym.Activation(data=conv1, act_type="tanh")
pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2, 2), stride=(2, 2))
# second conv layer
conv2 = mx.sym.Convolution(data=pool1, kernel=(5, 5), num_filter=50)
tanh2 = mx.sym.Activation(data=conv2, act_type="tanh")
pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2, 2), stride=(2, 2))
# first fullc layer
flatten = mx.sym.flatten(data=pool2)
fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 = mx.sym.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 = mx.sym.FullyConnected(data=tanh3, num_hidden=10)
# softmax loss
lenet = mx.sym.SoftmaxOutput(data=fc2, name='softmax')
import logging
logging.getLogger().setLevel(logging.DEBUG) # logging to stdout
# create a trainable module on GPU 0
lenet_model = mx.mod.Module(symbol=lenet, context=mx.gpu())
# train with the same
lenet_model.fit(train_iter,
eval_data=val_iter,
optimizer='sgd',
optimizer_params={'learning_rate': 0.1},
eval_metric='acc',
batch_end_callback=mx.callback.Speedometer(batch_size, 100),
num_epoch=10)
So I wonder how to make full usage of gpu? As far as I know, that increasing batch_size is useful--- but not for my script. And as the gpu usage often fluctuates from 0 to 19 in no time. I guess whether it is due to my dataiter creating data for every batch at real time ? If this is the problem, how to fix it?
Could we switch to some faster way for data feeding? I didn't see any way to do multi-thread with DataIter. I don't think PrefetchingIter
works here because the data is already in memory.
Any suggestions are welcome. Thank you!
@dbsxdbsx I suggest that you switch to Gluon and use a DataLoader
with num_workers
>1.
The bottleneck seems to be your image generation that is currently done synchronously rather than asynchronously.
With Gluon you could simply subclass the Dataset
class to generate your captcha asynchronously using a DataLoader
. That should solve your I/O issue and you should witness an increase in the GPU utilization.
@ThomasDelteil ,thanks for your answer. I think the problem here is that the dataset of captcha is generated online while training. Therefore, the gpu has to wait until cpu generates new batches of data. I guess this is the main issue, and I guess this is what you mean, right? And another problem is that I didn't see any example about using gluon Dataloader with data produced online---Could you show one? I DO WANT TO make everything with Dataloader.
@dbsxdbsx It's quite easy to use DataLoader with a custom DataSet object. Your custom DataSet class needs to implement only two functions: __len__()
and __getitem_()
. You can then easily use DataLoader with num_workers>1. Here is a dummy example of a custom dataset class that contains 1000 elements, each one the index plus some random noise:
class MyRandomDataset(object):
def __getitem__(self, idx):
return nd.array([idx]) + nd.random.normal()
def __len__(self):
return 1000
@safrooze Thanks
I test my py on AWS EC2 P3.2xlarge(GPU:V100), AMI: ami-77eb3a0f, python version : 2.7. The .py is as follow:
On my own host, win10, mx0.12, gpu:940M, I got near 110 samples/seconds with default params, but surprisingly, on p3.2xlarge, I got only 170 samples/seconds. In detail, with
watch -n 1 nvidia-smi
, I found the volatile GPU utile is always near 0%, up t0 4%. WHY??? Is that just because I got a custom DataIter?