DeepMark / deepmark

THE Deep Learning Benchmarks
Apache License 2.0
351 stars 41 forks source link

DeepSpeech2 benchmark technical details #1

Open soumith opened 8 years ago

soumith commented 8 years ago

Hey @shubho , can you give some technical details on the DeepSpeech2 benchmark so that the others can implement it to your exact spec.

Some details:

cc: @seannaren @delta2323

shubho commented 8 years ago

Hi Soumith,

                 I am traveling till June 12th and will be on internet

intermittently - Erich and David can fill in the details.

Thanks

Shubho

On Friday, June 3, 2016, Soumith Chintala notifications@github.com wrote:

Hey @shubho https://github.com/shubho , can you give some technical details on the DeepSpeech2 benchmark so that the others can implement it to your exact spec.

Some details:

  • Exact architecture
  • Criterion
  • The synthetic dataset: sample length, dimensionality, etc.
  • Any other detail that would be important

cc: @seannaren https://github.com/seannaren @delta2323 https://github.com/delta2323

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1, or mute the thread https://github.com/notifications/unsubscribe/ABIPeUyPtNOq1wutL4xz6bZoHYQn6hyYks5qHwrYgaJpZM4Isw4w .

soumith commented 8 years ago

awesome thanks.

ekelsen commented 8 years ago

The network specs are as follows:

{
    "connectivity": [
        "conv2d_1",
        "conv2d_2",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "bd",
        "fc",
        "ctc"
    ],
    "layers": {
        "bd": {
            "batch_norm": true,
            "dim": 1760,
            "type": "RecurrentLinear"
        },
        "conv2d_1": {
            "batch_norm": true,
            "channels": 1,
            "context_h": 5,
            "context_w": 20,
            "filters": 32,
            "is_same_w": true,
            "stride_h": 2,
            "stride_w": 2,
            "type": "Conv2DPackage"
        },
        "conv2d_2": {
            "batch_norm": true,
            "channels": 32,
            "context_h": 5,
            "context_w": 10,
            "filters": 32,
            "is_same_w": true,
            "stride_h": 1,
            "stride_w": 2,
            "type": "Conv2DPackage"
        },
        "ctc": {
            "type": "CTCCostLinear"
        },
        "fc": {
            "batch_norm": true,
            "dim": 1760,
            "type": "FullyConnected"
        }
    }
}

The raw input is a spectrogram that is 161 x (minibatch x time).

bd layers are bi-directional vanilla RNNs

The CTCCostLinear layer includes a linear transform to the alphabet size followed by a softmax. In English the alphabet size is 29. The criterion is a CTC loss done in logspace.

All non-linearities are clipped ReLU units (max of 20).

I will update this with the dataset information soon.

ekelsen commented 8 years ago

The dataset should be drawn from the following distribution:

Length (sec) Frequency (percent) Label Length
1 3.0 7
2 10.0 17
3 11.0 35
4 13.0 48
5 14.0 62
6 13.0 78
7 9.0 93
8 8.0 107
9 5.0 120
10 4.0 134
11 3.0 148
12 2.0 163
13 2.0 178
14 2.0 193
15 1.0 209

Each second corresponds to 100 input timesteps as we use a 10ms step.

SeanNaren commented 8 years ago

@ekelsen thanks for the specs! Could we get some information on how you chose the dataset specification?

ekelsen commented 8 years ago

It is similar to the distribution of one of our training sets.

nervetumer commented 8 years ago

What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above?

shubho commented 8 years ago

One could generate a training sample of different data lengths using that distribution - form minibatches so that a minibatch has utterances of equal length (that gets around the zero padding problem) and go from there. Mini batches should be all large as possible - but anything above 128 / GPU will either hit memory limits of GPUs first or unusable in practice (assuming multi-GPU training with 8 GPUs) due to convergence issues.

Shubho

On Mon, Jun 13, 2016 at 2:46 PM, nervetumer notifications@github.com wrote:

What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-225718552, or mute the thread https://github.com/notifications/unsubscribe/ABIPeQ61HPdGZNtA6t_OYo7yD8iLMJBcks5qLc-igaJpZM4Isw4w .

shubho commented 8 years ago

Just wanted to clarify that the benchmark can't test convergence at all - so maybe the minibatch should be wide enough to fit in GPU memory.

On Mon, Jun 13, 2016 at 10:56 PM, Shubho Sengupta shubho@gmail.com wrote:

One could generate a training sample of different data lengths using that distribution - form minibatches so that a minibatch has utterances of equal length (that gets around the zero padding problem) and go from there. Mini batches should be all large as possible - but anything above 128 / GPU will either hit memory limits of GPUs first or unusable in practice (assuming multi-GPU training with 8 GPUs) due to convergence issues.

Shubho

On Mon, Jun 13, 2016 at 2:46 PM, nervetumer notifications@github.com wrote:

What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-225718552, or mute the thread https://github.com/notifications/unsubscribe/ABIPeQ61HPdGZNtA6t_OYo7yD8iLMJBcks5qLc-igaJpZM4Isw4w .

nervetumer commented 8 years ago

I agree we could do that but then everyone will be benchmarking a different data set. It may not matter much for a large data set epoch but it seems like we should try to minimize the differences between all the benchmarks. So if we go this route maybe we should have a small python script here with a random number seed and random number generator that is platform independant which generates the sequence lengths? Or we should choose a publicly available dataset instead of using statistics from a private dataset.

shubho commented 8 years ago

I think we should have a script with a fixed seed to nail down the dataset.

Shubho

On Tuesday, June 14, 2016, nervetumer notifications@github.com wrote:

I agree we could do that but then everyone will be benchmarking a different data set. It may not matter much for a large data set epoch but it seems like we should try to minimize the differences between all the benchmarks. So if we go this route maybe we should have a small python script here with a random number seed and random number generator that is platform independant which generates the sequence lengths? Or we should choose a publicly available dataset instead of using statistics from a private dataset.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-225888577, or mute the thread https://github.com/notifications/unsubscribe/ABIPecvrbmOsotSZSvbsB_TsDI5arT_7ks5qLrMTgaJpZM4Isw4w .

ekelsen commented 8 years ago

I think that having to use real data for performance benchmarks is somewhat annoying and best avoided if possible.

@shubho will provide a python script to generate the dataset (input spectrograms and labels) so that the expected behavior is clear. I really don't think the exact floating point numbers and labels that are chosen will have any impact on the performance (we can test by changing the seed in python), so if it is easier to generate the dataset in a different language that should be fine as long the distribution is the same.

The minibatch is generally chosen to be the largest possible for the longest sequence length as we tend to keep the minibatch size constant during optimization. (There is work on variable mini-batches, but that isn't common, so I don't think that makes sense for the benchmark).

There are some odd performance cliffs when using CuBLAS, like going from a minibatch of 8 -> 9, and a minibatch of 96 significantly underperforms a minibatch of 64, so choosing the minibatch that is fastest overall might require some tuning. The nervana kernels mostly don't have these problems.

I doubt any of the frameworks will be able to exceed a global mini-batch of 1024, even on 8 GPUs. But it is true that in practice we notice degraded optimization performance beyond this mini-batch size. For benchmarking purposes I don't think we need to worry about that though.

ekelsen commented 8 years ago

The following script should be a reasonable generator for random data for this benchmark. The distribution of utterance lengths is fixed and does not depend on a random number generator and the generation itself should be quite fast and not affect overall benchmark timing.

If the chosen minibatch size is not a multiple of 2, then the last minibatch of a given utterance length will be smaller than usual. This is not exactly the same behavior as a real training system where we would lump together different length sequences. If people would prefer that behavior, let us know.

import numpy as np

class DataGenerator:

    """Generates DS2 test data for DeepMark benchmark.

       Returns utterance length in number of 10ms slices. So utt_length
       is set to 1000 for a 10s utterance.

       Returns spectrogram filled with random input. This is a
       two-dimensional Numpy array with dimensions
       161 x (utt_length * mb_size) where mb_size is the user supplied
       minibatch size.

       If mb_size is not a multiple of two, then the last minibatch
       for a particular utt_length may be less than mb_size.

       Returns label data filled with random input. This is a
       one-dimensional Numpy array with dimensions
       label length corresponding to the utterance length.

    """

    ### Set up initial state
    # Utterance lengths are in number of non-overlapping 10ms slices
    _utt_lengths = [100, 200, 300, 400, 500, 600, 700,
                    800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
    _counts = [3, 10, 11, 13, 14, 13, 9,
               8, 5, 4, 3, 2, 2, 2, 1]
    _label_lengths = [7, 17, 35, 48, 62, 78, 93, 107,
                      120, 134, 148, 163, 178, 193, 209]
    _freq_bins = 161

    # 29 characters in english dataset - all equally likely to be
    # selected for now
    _prob_chars = [1 / 29.] * 29
    _chars = range(29)

    # minimum number of utterances to generate for a count of 1
    _scale_factor = 10 * 128

    # extra space to allow for different minibatch data even though
    # we only generate one set of random numbers for speed
    _extra = 1000

    def __init__(self, minibatch_size):
        self._current = 0
        self._mb_size = minibatch_size

        # Generate all the utterance lengths that we need
        self._utt_counts = [self._scale_factor * x for x in self._counts]

        # only generate random data once so that the data generation
        # is as fast as possible and doesn't interfere with benchmark
        # timing
        self._randomness = np.random.randn(self._freq_bins,
                                           minibatch_size *
                                           (self._utt_lengths[-1]) +
                                           self._extra
                                           ).astype(np.float32)

    def __iter__(self):
        return self

    def next(self):
        if self._current >= len(self._utt_counts):
            raise StopIteration
        else:
            # Generate an utterance length
            if (self._utt_counts[self._current] > self._mb_size):
                mb_size = self._mb_size
                self._utt_counts[self._current] -= self._mb_size
                inc = 0
            else:
                mb_size = self._utt_counts[self._current]
                self._utt_counts[self._current] = 0
                inc = 1

            utt_length = self._utt_lengths[self._current]

            # Create random label data
            label_length = self._label_lengths[self._current]

            start = np.random.randint(0, self._extra +
                                         self._mb_size *
                                             (self._utt_lengths[-1] -
                                              self._utt_lengths[self._current])
                                     )
            end = start + utt_length * mb_size

            self._current += inc

            return utt_length, \
                   self._randomness[:, start:end], \
                   np.random.choice(self._chars, label_length,
                                    self._prob_chars)
SeanNaren commented 8 years ago

Sounds great, thanks @ekelsen! Not sure what would be more fit for the torch benchmark; should I use a library to access the above python code in lua, or rewrite the class in lua? I personally prefer to rewrite, but whatever is more appropriate!

shubho commented 8 years ago

I feel rewriting is fine - the important parts are the distribution, total number of samples and the way they are divided into minibatches.

Shubho

On Wednesday, June 22, 2016, Sean Naren notifications@github.com wrote:

Sounds great, thanks @ekelsen https://github.com/ekelsen! Not sure what would be more fit for the torch benchmark; should I use a library to access the above python code in lua, or rewrite the class in lua? I personally prefer to rewrite, but whatever is more appropriate!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-227778295, or mute the thread https://github.com/notifications/unsubscribe/ABIPec5G9ecutKR3yEuzLamnoJbOnsnQks5qOVKtgaJpZM4Isw4w .

SeanNaren commented 8 years ago

@shubho thanks, bit confused as to how the generator is to be used. Do the below steps cover what the benchmark using the generator is supposed to be?

  1. generator:next()
  2. Forward pass, record forward time
  3. Backward pass, record backward time
  4. loop from 1. until iterator finished
  5. Average for each loop, Sum times
shubho commented 8 years ago

Yeah and you can choose the appropriate minibatch that gives you the fastest time.

Shubho

On Thursday, June 23, 2016, Sean Naren notifications@github.com wrote:

Great, bit confused as to how the generator is to be used. Do the below steps cover what the benchmark using the generator is supposed to be?

  1. generator:next()
  2. Forward pass, record forward time
  3. Backward pass, record backward time
  4. loop from 1. until iterator finished
  5. Sum times

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-228029212, or mute the thread https://github.com/notifications/unsubscribe/ABIPeQdiO1zn52zTIMcAgfvXYbX2NdMzks5qOnW5gaJpZM4Isw4w .

SeanNaren commented 8 years ago

Awesome so just to summarise:

Benchmark is the average forward/backward/forward+backward time taken to run through the entire synthetic dataset using the DS2 architecture (using fastest batch size).

Steps are:

  1. generator:next()
  2. Forward pass, record forward time
  3. Backward pass, record backward time
  4. loop from 1. until iterator finished
  5. Report average forward time/backward time/ forward+backward time

Sorry if this is obvious, just trying to nail the details :)

shubho commented 8 years ago

I usually report total time for forward and back prop, number of minibatches and also average but I haven't checked DeepMark's requirements. What is reported should be consistent across all networks.

Shubho

On Thursday, June 23, 2016, Sean Naren notifications@github.com wrote:

Awesome so just to summarise:

Benchmark is the average forward/backward/forward+backward time taken to run through the entire synthetic dataset using the DS2 architecture (using fastest batch size).

Steps are:

  1. generator:next()
  2. Forward pass, record forward time
  3. Backward pass, record backward time
  4. loop from 1. until iterator finished
  5. Report average forward time/backward time/ forward+backward time

Sorry if this is obvious, just trying to nail the details :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-228155324, or mute the thread https://github.com/notifications/unsubscribe/ABIPeZLw3__ko_Zjk9pPxSPspqjXrD1gks5qOtuagaJpZM4Isw4w .

SeanNaren commented 8 years ago

@shubho, I was going off the covnet benchmark structure for forward/backward/forward+backward average, but if you think total is a better measurement that could be an alternative, @soumith what would you suggest?

shubho commented 8 years ago

Consistency is more important than what I suggested.

Shubho

On Thursday, June 23, 2016, Sean Naren notifications@github.com wrote:

@shubho https://github.com/shubho, I was going off the covnet benchmark structure for forward/backward/forward+backward average, but if you think total is a better measurement that could be an alternative, @soumith https://github.com/soumith what would you suggest?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-228162933, or mute the thread https://github.com/notifications/unsubscribe/ABIPeea18Uvst4GDewb2l1HCGDi7ZnjWks5qOuLegaJpZM4Isw4w .

SeanNaren commented 7 years ago

Interested to see how the other DS2 benchmarks are progressing, any news guys? cc @shubho @ekelsen @soumith @nervetumer

shubho commented 7 years ago

I am planning to get to the internal one this weekend.

Shubho

On Thu, Jul 7, 2016 at 5:46 AM, Sean Naren notifications@github.com wrote:

Interested to see how the other DS2 benchmarks are progressing, any news guys? cc @shubho https://github.com/shubho @ekelsen https://github.com/ekelsen @soumith https://github.com/soumith @nervetumer https://github.com/nervetumer

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-231067259, or mute the thread https://github.com/notifications/unsubscribe/ABIPeU1O8eOc2CLsEH_5O7KMDfB_PXzHks5qTPVCgaJpZM4Isw4w .

SeanNaren commented 7 years ago

Shall we start benchmarking numbers? Just to confirm we are measuring the time it takes to run through the entire dataset iterator?

soumith commented 7 years ago

@SeanNaren yea benchmarking through the dataset iterator sounds right. Time to benchmark!

SeanNaren commented 7 years ago

Some preliminary results to get the ball rolling, I benchmarked a 1xTitan and a 4xTitan setup of the Torch implementation (shoutout to @digitalreasoning for letting me use their servers!) with 5 epochs of the dataset:

Hardware Time (ms) forward (ms) backward (ms) Samples processed Samples processed per second Seconds of audio processed per second Epoch time (s)
1x Titans 154 83 72 128000 32 189 4013
4x Titans 180 99 81 128000 106 632 1204

I could only manage a 32 batch/GPU in memory for the epoch.

pooyadavoodi commented 7 years ago

I suggest to add a column for multi-gpu scaling for the final presentation.

ngimel commented 7 years ago

@SeanNaren, any particular reason you are not using cudnn.BatchNormalization in BatchBRNN.lua and use nn.BatchNormalization instead? Thanks for your work on this!

seed93 commented 7 years ago

@ngimel cudnn.BatchNormalization only supports batchsize < 1024 in inference mode. I think this is the point.

SeanNaren commented 7 years ago

@ngimel What @seed93 said :) Thanks for the changes on the benchmark, ill find time to re-run these!

EDIT: though if you manage to find a way we could use cuDNN batch norm please let me know (we could add cudnn.BatchNorm since we are only training, would be up for opionions on this)!

ngimel commented 7 years ago

@SeanNaren, there are a few ways: 1) you can use cudnn batchnorm for training, and switch to nn for inference, since the modules are compatible. 2) cudnn bindings can be modified for inference to call cudnn batchnorm a few times with smaller batch sizes - bn at inference time is a pointwise operation, so results should not be affected. 3) I'll also check if this limitation can be removed for future cudnn versions. We've seen speedups using cudnn.BatchNorm instead of nn.BatchNorm, so I see no reason not to use it for training benchmark.

SeanNaren commented 7 years ago

@ngimel sounds fair will modify the benchmark to use cuDNN BatchNorm! Don't want to spam this issue but want to address a few things with the torch implementation, I'll open a separate issue.