CPU backend on Version 1.1 is slower than 0.9?!

fariasfc commented 8 years ago

Hello! First of all, congratulations to all the team that is developing neon!

I was using the version 0.9 and this was really fast on CPU. I updated the neon version to 1.1 and, with the same model architecture, each batch is taking ~5 seconds. On 0.9, hundreds of batches was done in less than 1s.

I am using Convolutional Layers in my model.

The CPU Backend became slow in order to have the autodiff?

Is there anything that I could do, or only a downgrade would give me the prior speed?

Thank you!

apark263 commented 8 years ago

So conservatively speaking you are noticing 3 orders of magnitude slowdown? (200+ batches in under 1sec , vs .2 batches per second).

Could you post your model architecture? It is possible that we slowed cpu performance in the course of updating backends (0.9 to 1.0 was a complete rewrite), but I didn't think it was that bad

On Sunday, November 1, 2015, felipefariax notifications@github.com wrote:

Hello! First of all, congratulations to all the team that is developing neon!

I was using the version 0.9 and this was really fast on CPU. I updated the neon version to 1.1 and, with the same model architecture, each batch is taking ~5 seconds. On 0.9, hundreds of batches was done in less than 1s.

I am using Convolutional Layers in my model.

The CPU Backend became slow in order to have the autodiff?

Is there anything that I could do, or only a downgrade would give me the prior speed?

Thank you!

— Reply to this email directly or view it on GitHub https://github.com/NervanaSystems/neon/issues/125.

fariasfc commented 8 years ago

Yes, @apark263! I am also surprised...

It's a very simple model...

init = GlorotUniform()
layers = [Conv(fshape=(1, 3, self.nb_channels * 2), init=init, activation=Rectlin()), \
           Pooling(op='max', fshape={'str_h' : 1, 'str_w' : 2}), \
           Affine(nout=self.nb_classes, init=init, activation=Logistic())]

self.cost = GeneralizedCost(costfunc = CrossEntropyMulti())

self.optimizer = Adadelta(decay=0.9, epsilon=1e-10)

self.model = Model(layers=layers, optimizer=self.optimizer)

I've printed some infos during each batch run:

CPUTensor(base 0x7f49ad4e9f80) name:None shape:(2048, 128) dtype:<type 'numpy.float32'> strides:(512, 4) is_c_contiguous:True batch: (2048, 128) fprop in 2.05480408669 bprop in 10.1289958954 optimize in 0.0104749202728 Epoch 0 [Train | | 1/5909 batches, 0.03 cost, 12.34s]CPUTensor(base 0x7f49ad4e9f80) name:None shape:(2048, 128) dtype:<type 'numpy.float32'> strides:(512, 4) is_c_contiguous:True batch: (2048, 128) fprop in 2.08032488823 bprop in 9.82120990753 optimize in 0.0126368999481 Epoch 0 [Train | | 2/5909 batches, 0.00 cost, 24.30s]CPUTensor(base 0x7f49ad4e9f80) name:None shape:(2048, 128) dtype:<type 'numpy.float32'> strides:(512, 4) is_c_contiguous:True batch: (2048, 128) fprop in 2.13239097595 bprop in 9.75532984734 optimize in 0.0100581645966

scott-gray commented 8 years ago

Here's an example of some much faster cpu conv code. It eliminates the fancy indexing which requires a copy of the tensor to be made prior to the dot. So this code does the dots in place. We'll update all the conv/pooling operations with this code at somepoint soon, but you may want to take a crack at it sooner. The new elementwise/autodiff code might also be making more tensor copies for various operations.. we'll investigate that as well.

def fconv_slice(q, S, X, padding, strides):
    qs = q * strides - padding
    firstF = None
    for s in range(S):
        x = qs + s
        if x >= 0 and x < X:
            if firstF is None:
                firstF = s
                firstI = x
            lastF = s
            lastI = x
    return (slice(firstF,lastF+1), slice(firstI,lastI+1))

def fprop_direct(I, F, O, padding, strides):

    C, Y, X, N = I.shape
    C, R, S, K = F.shape
    K, P, Q, N = O.shape

    qSlice = [ fconv_slice(q, S, X, padding, strides) for q in range(Q) ]

    for p in range(P):
        sliceR, sliceY = fconv_slice(p, R, Y, padding, strides)

        for q in range(Q):
            sliceS, sliceX = qSlice[q]

            slicedF = F[:,sliceR,sliceS,:].reshape((-1, K))
            slicedI = I[:,sliceY,sliceX,:].reshape((-1, N))

            O[:,p,q,:] = np.dot( slicedF.T,  slicedI )

fariasfc commented 8 years ago

Thank you for the code, @scott-gray.

I have started with neon two days ago, I am not aware about the right places to modify, so I will wait... Do you know how much 'soon' the change would occur? days, weeks or months?

If I may contributte a little, I think that the bprop phase must also be reviewed, since it took about 10s. I hope the problems are related.

Thanks!

yxlao commented 8 years ago

Hi @felipefariax, it might also be helpful to double check that the numpy BLAS / ATLAS is properly configured, since sometimes virtualenv does not pick up these libraries.

import numpy as np
np.__config__.show()

fariasfc commented 8 years ago

@linuxthink, I think that's ok...

Here's the output:

    lapack_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/home/fcf/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/home/fcf/anaconda/include']
    blas_opt_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/home/fcf/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/home/fcf/anaconda/include']
    openblas_lapack_info:
  NOT AVAILABLE
    lapack_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/home/fcf/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/home/fcf/anaconda/include']
    blas_mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/home/fcf/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/home/fcf/anaconda/include']
    mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['/home/fcf/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/home/fcf/anaconda/include']

scttl commented 8 years ago

Hi,

I tried comparing a couple of equivalent convolutional nets on our CPU backend (running on a Late 2013 Macbook Pro w/ 2.3 GHz i7 processor) and I also see a slowdown with the newer release but nothing quite so dramatic as what you report:

v1.1.1 (examples/cifar10_conv.py adjusted to turn off batch norm):

(.venv)scottl@scottnrvlap:~/repo/neon> ./examples/cifar10_conv.py -b cpu
Epoch 0   [Train |████████████████████|  391/391  batches, 1.56 cost, 246.31s]
Epoch 1   [Train |████████████████████|  391/391  batches, 1.33 cost, 243.92s]
Epoch 2   [Train |████████████████████|  390/390  batches, 1.25 cost, 237.78s]

v0.9.0 (cifar10-small.yaml adjusted to use uniform init, and full dataset instead of 10% sample that comes by default):

(v0.9.0)scottl@scottnrvlap:~/repo/neon> neon cifar10-small.yaml 
WARNING:neon.util.persist:deserializing object from:  cifar10-small.yaml
2015-11-11 15:43:52,371 WARNING:neon - setting log level to: 20
2015-11-11 15:43:52,371 INFO:cpu - Seeding random number generator with: None
2015-11-11 15:43:52,375 INFO:__init__ - CPU backend, RNG seed: None, numerr: None
2015-11-11 15:43:52,378 INFO:mlp - Layers:
    DataLayer d0: 3 x (32 x 32) nodes
    ConvLayer layer1: 3 x (32 x 32) inputs, 16 x (28 x 28) nodes, Linear act_fn
    PoolingLayer layer2: 16 x (28 x 28) inputs, 16 x (14 x 14) nodes, Linear act_fn
    ConvLayer layer4: 16 x (14 x 14) inputs, 32 x (10 x 10) nodes, Linear act_fn
    PoolingLayer layer5: 32 x (10 x 10) inputs, 32 x (5 x 5) nodes, Linear act_fn
    FCLayer layer6: 800 inputs, 500 nodes, RectLin act_fn
    FCLayer output: 500 inputs, 10 nodes, Logistic act_fn
    CostLayer cost: 10 nodes, CrossEntropy cost_fn

2015-11-11 15:43:52,416 INFO:val_init - Generating UniformValGen values of shape (75, 16)
2015-11-11 15:43:52,435 INFO:val_init - Generating UniformValGen values of shape (400, 32)
2015-11-11 15:43:52,436 INFO:val_init - Generating UniformValGen values of shape (500, 800)
2015-11-11 15:43:52,445 INFO:val_init - Generating UniformValGen values of shape (10, 500)
2015-11-11 15:43:52,447 INFO:cifar10 - loading: /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_1
2015-11-11 15:43:52,447 WARNING:persist - deserializing object from:  /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_1
2015-11-11 15:43:52,691 INFO:cifar10 - loading: /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_2
2015-11-11 15:43:52,691 WARNING:persist - deserializing object from:  /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_2
2015-11-11 15:43:52,931 INFO:cifar10 - loading: /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_3
2015-11-11 15:43:52,931 WARNING:persist - deserializing object from:  /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_3
2015-11-11 15:43:53,131 INFO:cifar10 - loading: /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_4
2015-11-11 15:43:53,132 WARNING:persist - deserializing object from:  /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_4
2015-11-11 15:43:53,326 INFO:cifar10 - loading: /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_5
2015-11-11 15:43:53,326 WARNING:persist - deserializing object from:  /Users/scottl/data/CIFAR10/cifar-10-batches-py/data_batch_5
2015-11-11 15:43:53,528 INFO:cifar10 - loading: /Users/scottl/data/CIFAR10/cifar-10-batches-py/test_batch
2015-11-11 15:43:53,528 WARNING:persist - deserializing object from:  /Users/scottl/data/CIFAR10/cifar-10-batches-py/test_batch
2015-11-11 15:43:53,698 WARNING:dataset - Incompatible batch size. Discarding 16 samples...
2015-11-11 15:43:53,842 WARNING:dataset - Incompatible batch size. Discarding 80 samples...
2015-11-11 15:43:54,521 WARNING:dataset - Incompatible batch size. Discarding 16 samples...
2015-11-11 15:43:54,521 WARNING:dataset - Incompatible batch size. Discarding 80 samples...
2015-11-11 15:43:54,535 INFO:mlp - commencing model fitting
2015-11-11 15:45:22,090 INFO:mlp - epoch: 1, training error: 2.67356
2015-11-11 15:46:47,932 INFO:mlp - epoch: 2, training error: 2.17991
2015-11-11 15:48:14,780 INFO:mlp - epoch: 3, training error: 1.95347

So on this architecture we're looking at a slowdown of about 2.8x going from v0.9.0 to v1.1.1. Not insignificant but certainly not the ~1,000x you were seeing.

The differences that I see between this network and yours is that you are using Adadelta instead of GradientDescentMomentum, and you have non-square filters in your conv and pooling layers.

Swapping in Adadelta for v1.1.1 cifar_conv showed no real difference in run-time:

(.venv)scottl@scottnrvlap:~/repo/neon> ./examples/cifar10_conv.py -b cpu -e 3
Epoch 0   [Train |████████████████████|  391/391  batches, 2.03 cost, 239.54s]
Epoch 1   [Train |████████████████████|  391/391  batches, 1.94 cost, 240.31s]

So then I also tried changing the first layer conv filter to (1, 5, 16) and pooling to (1, 2):

(.venv)scottl@scottnrvlap:~/repo/neon> ./examples/cifar10_conv.py -b cpu -e 3
Epoch 0   [Train |████████████████████|  391/391  batches, 2.03 cost, 486.60s]
Epoch 1   [Train |████████████████████|  391/391  batches, 1.95 cost, 491.00s]

Definite performance hit with non-square filters, but we're still only talking about a 5x difference or so from v0.9.0 (with square filters).

One final thing I noticed is that you are using CrossEntropyMulti cost with Logistic activation on the output. If you have more than 2 output classes, you should switch the Logistic for Softmax so that you can take advantage of shortcut derivatives. If you just have two outputs instead switch CrossEntropyMulti for CrossEntropyBinary. I'd recommend trying that and also attempting to run the cifar10 conv examples I described across the two versions to ensure you see similar timing differences.

Try that and let us know how it goes.

NervanaSystems / neon

CPU backend on Version 1.1 is slower than 0.9?! #125