apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.74k stars 6.8k forks source link

request I/O multiprocessing error #7593

Open L1aoXingyu opened 6 years ago

L1aoXingyu commented 6 years ago

For bugs or installation issues, please provide the following information. The more information you provide, the more likely people will be able to help you.

Environment info

Operating System: 16.04.2 LTS

Compiler:

Package used (Python/R/Scala/Julia): python

MXNet version: mxnet-cu80 0.11

Or if installed from source:

MXNet commit hash (git rev-parse HEAD):

If you are using python package, please provide

Python version and distribution:

If you are using R package, please provide

R sessionInfo():

Error Message:

I think Gluon should be faster than pytorch, or at least the same speed. But I write a small network, lenet using gluon and pytorch. The hyperparameters are same. I run 20 epochs, and the total time of pytorch is 69.515576 s, but time of the gluon is 175.097399 s. It seems gluon is much slower than pytorch. I don't know if I write gluon code in a wrong way.

Here is my code of two version.

Pytorch

import torch
import torchvision
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
from torch.autograd import Variable
from torch import optim
import torch.nn as nn
import torch.nn.functional as F
import time

learning_rate = 1e-3
batch_size = 64
epoches = 20

trans_img = transforms.ToTensor()

trainset = MNIST('./data', train=True, transform=trans_img)
testset = MNIST('./data', train=False, transform=trans_img)

trainloader = DataLoader(
    trainset, batch_size=batch_size, shuffle=True, num_workers=4)
testloader = DataLoader(
    testset, batch_size=batch_size, shuffle=False, num_workers=4)

# build network
class Lenet(nn.Module):
    def __init__(self):
        super(Lenet, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 6, 3, stride=1, padding=1),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(6, 16, 5, stride=1, padding=0), nn.MaxPool2d(2, 2))

        self.fc = nn.Sequential(
            nn.Linear(400, 120), nn.Linear(120, 84), nn.Linear(84, 10))

    def forward(self, x):
        out = self.conv(x)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

lenet = Lenet()
lenet.cuda()

criterian = nn.CrossEntropyLoss(size_average=False)
optimizer = optim.SGD(lenet.parameters(), lr=learning_rate)

# train
start = time.time()
for i in range(epoches):
    running_loss = 0.
    running_acc = 0.
    for (img, label) in trainloader:
        img = Variable(img).cuda()
        label = Variable(label).cuda()

        optimizer.zero_grad()
        output = lenet(img)
        loss = criterian(output, label)
        # backward
        loss.backward()
        optimizer.step()

        running_loss += loss.data[0]
        _, predict = torch.max(output, 1)
        correct_num = (predict == label).sum()
        running_acc += correct_num.data[0]

    running_loss /= len(trainset)
    running_acc /= len(trainset)
    print("[%d/%d] Loss: %.5f, Acc: %.2f" % (i + 1, epoches, running_loss,
                                             100 * running_acc))

print('Time {:.6f}'.format(time.time() - start))

gluon

import time

import mxnet as mx
import mxnet.gluon as g
import numpy as np

# define hyperparameters
batch_size = 64
learning_rate = 1e-3
epochs = 20
step = 300
ctx = mx.gpu()

# define data transform
def data_transform(data, label):
    return mx.nd.transpose(data.astype(np.float32) / 255,
                           (2, 0, 1)), label.astype(np.float32)

# define dataset and dataloader
train_dataset = g.data.vision.MNIST(transform=data_transform)
test_dataset = g.data.vision.MNIST(train=False, transform=data_transform)

train_loader = g.data.DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True)
test_loader = g.data.DataLoader(
    test_dataset, batch_size=batch_size, shuffle=False)

# define model
lenet = g.nn.Sequential(prefix='lenet_')
with lenet.name_scope():
    lenet.add(g.nn.Conv2D(6, 3, strides=1, padding=1))
    lenet.add(g.nn.MaxPool2D(2, 2))
    lenet.add(g.nn.Conv2D(16, 5, strides=1))
    lenet.add(g.nn.MaxPool2D(2, 2))
    lenet.add(g.nn.Flatten())
    lenet.add(g.nn.Dense(120))
    lenet.add(g.nn.Dense(84))
    lenet.add(g.nn.Dense(10))

lenet.collect_params().initialize(mx.init.Xavier(), ctx=ctx)

criterion = g.loss.SoftmaxCrossEntropyLoss()
optimizer = g.Trainer(lenet.collect_params(), 'sgd',
                      {'learning_rate': learning_rate})

# start train
start = time.time()
for e in range(epochs):
    print('*' * 10)
    print('epoch {}'.format(e + 1))
    moving_loss = 0.0
    moving_acc = 0.0
    for i, (img, label) in enumerate(train_loader, 1):
        img = img.as_in_context(ctx)
        label = label.as_in_context(ctx)
        with g.autograd.record():
            output = lenet(img)
            loss = criterion(output, label)
        loss.backward()
        optimizer.step(img.shape[0])
        # =========== keep average loss and accuracy ==============
        moving_loss += mx.nd.mean(loss).asscalar()
        predict = mx.nd.argmax(output, axis=1)
        acc = mx.nd.mean(predict == label).asscalar()
        moving_acc += acc

        if i % step == 0:
            print('[{}/{}] Loss: {:.6f}, Acc: {:.6f}'.format(
                i, len(train_loader), moving_loss / step, moving_acc / step))
            moving_loss = 0.0
            moving_acc = 0.0
print('Time {:.6f} s'.format(time.time() - start))

Minimum reproducible example

if you are using your own code, please provide a short script that reproduces the error.

Steps to reproduce

or if you are running standard examples, please provide the commands you have run that lead to the error.

1. 2. 3.

What have you tried to solve it?

1. 2. 3.

szha commented 6 years ago

Try using hybrid layers (i.e. lenet = g.nn.HybridSequential(prefix='lenet_'))and hybridizing the network. (i.e. lenet.hybridize())

L1aoXingyu commented 6 years ago

I will try this and give you feedback. But I am still confused why gluon imperative graph is much slower than pytorch?

L1aoXingyu commented 6 years ago

@szha I chance Sequential to HybridSequential, it is a little faster than Sequential, from 175 s to 168 s, but it's still much slower than pytorch.I also check that I run this model on GPU. If there is no wrong in my code, there must be some problem in Gluon, I think. Could you please tell me the reason? As I know, Gluon is actually faster than PyTorch even if I use Sequential rather than HybridSequential.

L1aoXingyu commented 6 years ago

I notice that in PyToch DataLoader, there is a parameter named num_worker, but in Gluon DataLoader, there is no such a parameter. This parameter can do multiprocess work. If I set num_works = 0, then PyTorch need about 100 s. So I think this is one of the reason why Gluon is much slower than PyTorch. But even if PyTorch needs 100 s, it's still faster than gluon, so I think there may be some problems.

szha commented 6 years ago

Yes, I'm guessing it's the I/O that can be improved. @piiswrong @zhreshold

zhreshold commented 6 years ago

@SherlockLiao How many gpus do you have? And what model is it? I could try to debug it.

L1aoXingyu commented 6 years ago

@zhreshold just one gpu. It's a simple model, 2 convolution layer, 2 max pooling, 3 dense layer to do mnist classification. I write a same code using mx.sym, then it's very fast, about 20s. I think there must be problem in gluon.

zhreshold commented 6 years ago

@piiswrong I've tested the gluon code, it's the data transform problem. Forward/backward/optimizer takes 70s on p2, while pure IO without any network inference takes 200s. If use dummy data,

train_dataset = g.data.ArrayDataset(mx.nd.zeros((50000, 1, 28, 28)), mx.nd.zeros((50000,1)))
test_dataset = g.data.ArrayDataset(mx.nd.zeros((10000, 1, 28, 28)), mx.nd.zeros((10000, 1)))

20 Epochs finished in 80s. I guess @SherlockLiao can get better results on his machine.

The transform was executed on python main thread, that's why it's slow.

L1aoXingyu commented 6 years ago

@zhreshold Can you give me some suggestions about how to do data transform?

ZhichengHuang commented 6 years ago

@piiswrong This is issue haven't solve ,can you give me some suggestions about how to make the transform faster? or How to solve this issue.Thank you.

SCP-173-cool commented 6 years ago

I found the same issue. Gluon is slower than the traditional mxnet api.