Training convolutional style MLP slower on GPU than on CPU

Piyush3dB commented 8 years ago

Hello,

I've setup a simple MLP network using convolutional layers (mx.symbol.Convolution) instead of fully connected layers (mx.symbol.FullyConnected). The network definition is as follows:

def get_mlpcn():
    """
    multi-layer perceptron using 1x1 convolutions
    """
    data = mx.symbol.Variable('data')
    fc1  = mx.symbol.Convolution(data=data, kernel=(28,28), num_filter=128)
    act1 = mx.symbol.Activation( data=fc1 , name='relu1', act_type="relu")
    fc2  = mx.symbol.Convolution(data=act1, kernel=(1,1), num_filter=64)
    act2 = mx.symbol.Activation( data=fc2 , name='relu2', act_type="relu")
    fc3  = mx.symbol.Convolution(data=act2, kernel=(1,1), num_filter=10)
    flt  = mx.symbol.Flatten(data=fc3)
    mlp  = mx.symbol.SoftmaxOutput(data = flt, name = 'softmax')

    group = mx.symbol.Group([data, fc1, act1, fc2, act2, fc3, flt, mlp])
    return mlp, group

I've done the shape inference and verified everything is correct.

The problem I have is this takes much longer to train on the GPU (GTX 980Ti) than CPU (i5-6500).

For 1 epoch mnist training the GPU performance is:

2016-05-02 08:53:27,731 Node[0] Epoch[0] Resetting Data Iterator
2016-05-02 08:53:27,732 Node[0] Epoch[0] Train-accuracy=0.911508
2016-05-02 08:53:27,732 Node[0] Epoch[0] Train-top_k_accuracy_5=0.990986
2016-05-02 08:53:27,732 Node[0] Epoch[0] Train-top_k_accuracy_10=1.000000
2016-05-02 08:53:27,732 Node[0] Epoch[0] Train-top_k_accuracy_20=1.000000
2016-05-02 08:53:27,732 Node[0] Epoch[0] Time cost=144.460
2016-05-02 08:53:28,175 Node[0] Epoch[0] Validation-accuracy=0.958734
2016-05-02 08:53:28,175 Node[0] Epoch[0] Validation-top_k_accuracy_5=0.999099
2016-05-02 08:53:28,175 Node[0] Epoch[0] Validation-top_k_accuracy_10=1.000000
2016-05-02 08:53:28,175 Node[0] Epoch[0] Validation-top_k_accuracy_20=1.000000

Whereas for CPU is:

2016-05-02 08:54:47,443 Node[0] Epoch[0] Resetting Data Iterator
2016-05-02 08:54:47,444 Node[0] Epoch[0] Train-accuracy=0.911508
2016-05-02 08:54:47,444 Node[0] Epoch[0] Train-top_k_accuracy_5=0.990986
2016-05-02 08:54:47,444 Node[0] Epoch[0] Train-top_k_accuracy_10=1.000000
2016-05-02 08:54:47,444 Node[0] Epoch[0] Train-top_k_accuracy_20=1.000000
2016-05-02 08:54:47,444 Node[0] Epoch[0] Time cost=66.001
2016-05-02 08:54:54,612 Node[0] Epoch[0] Validation-accuracy=0.958734
2016-05-02 08:54:54,612 Node[0] Epoch[0] Validation-top_k_accuracy_5=0.999099
2016-05-02 08:54:54,612 Node[0] Epoch[0] Validation-top_k_accuracy_10=1.000000
2016-05-02 08:54:54,612 Node[0] Epoch[0] Validation-top_k_accuracy_20=1.000000

The time cost for GPU is more than twice that of CPU!!

I'm wondering if anyone knows why this is happening, and how I can debug this to find where the bottlenecks are?

Many thanks!

pluskid commented 8 years ago

There is communication overhead between GPU and CPU (memory). If your model is too tiny, then it does not worth running on GPUs. If you try at least the cifar10 (or maybe even the LeNet on MNIST) you will see difference.

Piyush3dB commented 8 years ago

@pluskid thanks for your reply.

I've performed another experiment training the same MLP network on GPU, but this time formulating it using fully connected layers (mx.symbol.FullyConnected) instead of convolutional layers (mx.symbol.Convolution) as in

def get_mlp():
    """
    multi-layer perceptron using fully connected layers
    """
    data = mx.symbol.Variable('data')
    fc1  = mx.symbol.FullyConnected(data = data, name='fc1', num_hidden=128)
    act1 = mx.symbol.Activation(data = fc1, name='relu1', act_type="relu")
    fc2  = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
    act2 = mx.symbol.Activation(data = fc2, name='relu2', act_type="relu")
    fc3  = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=10)
    mlp  = mx.symbol.SoftmaxOutput(data = fc3, name = 'softmax')

    group = mx.symbol.Group([data, fc1, act1, fc2, act2, fc3, mlp])
    return mlp, group

This formulation gives much faster performance (0.8 instead of 144 timecost):

2016-05-02 17:47:56,896 Node[0] Epoch[0] Resetting Data Iterator
2016-05-02 17:47:56,896 Node[0] Epoch[0] Train-accuracy=0.908437
2016-05-02 17:47:56,896 Node[0] Epoch[0] Train-top_k_accuracy_5=0.990485
2016-05-02 17:47:56,897 Node[0] Epoch[0] Train-top_k_accuracy_10=1.000000
2016-05-02 17:47:56,897 Node[0] Epoch[0] Train-top_k_accuracy_20=1.000000
2016-05-02 17:47:56,897 Node[0] Epoch[0] Time cost=0.817
2016-05-02 17:47:56,975 Node[0] Epoch[0] Validation-accuracy=0.958233
2016-05-02 17:47:56,975 Node[0] Epoch[0] Validation-top_k_accuracy_5=0.998998
2016-05-02 17:47:56,975 Node[0] Epoch[0] Validation-top_k_accuracy_10=1.000000
2016-05-02 17:47:56,975 Node[0] Epoch[0] Validation-top_k_accuracy_20=1.000000

So I'm curious to know why an MLP using 1x1 convolutions is way slower than the fully-connected variant when in fact they are both mathematically the same!

Two questions:

Do you agree both networks (get_mlp() and get_mlpcn() ) are mathematically equivalent?
If yes, then why does the convolutional one take way too long when trained with the same parameters? Is there something different about the convolutional one? FYI LeNet has Time cost=2.099 for the same training parameters.

Many thanks!

pluskid commented 8 years ago

Similarly, solving Ax=b via Gaussian elimination and via computing inv(A)*b are mathematically equivalent but one is much slower than the other. Hope this helps clarify the puzzle.

Piyush3dB commented 8 years ago

@pluskid I see what you mean, thanks. Hopefully I'll start to understand the implementation differences as I get more familiar with the code, and be able to profile and see where exactly the bottle necks are in this experimental setup.

apache / mxnet

Training convolutional style MLP slower on GPU than on CPU #2011