Open soumith opened 8 years ago
Hi Soumith,
I am traveling till June 12th and will be on internet
intermittently - Erich and David can fill in the details.
Thanks
Shubho
On Friday, June 3, 2016, Soumith Chintala notifications@github.com wrote:
Hey @shubho https://github.com/shubho , can you give some technical details on the DeepSpeech2 benchmark so that the others can implement it to your exact spec.
Some details:
- Exact architecture
- Criterion
- The synthetic dataset: sample length, dimensionality, etc.
- Any other detail that would be important
cc: @seannaren https://github.com/seannaren @delta2323 https://github.com/delta2323
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1, or mute the thread https://github.com/notifications/unsubscribe/ABIPeUyPtNOq1wutL4xz6bZoHYQn6hyYks5qHwrYgaJpZM4Isw4w .
awesome thanks.
The network specs are as follows:
{
"connectivity": [
"conv2d_1",
"conv2d_2",
"bd",
"bd",
"bd",
"bd",
"bd",
"bd",
"bd",
"fc",
"ctc"
],
"layers": {
"bd": {
"batch_norm": true,
"dim": 1760,
"type": "RecurrentLinear"
},
"conv2d_1": {
"batch_norm": true,
"channels": 1,
"context_h": 5,
"context_w": 20,
"filters": 32,
"is_same_w": true,
"stride_h": 2,
"stride_w": 2,
"type": "Conv2DPackage"
},
"conv2d_2": {
"batch_norm": true,
"channels": 32,
"context_h": 5,
"context_w": 10,
"filters": 32,
"is_same_w": true,
"stride_h": 1,
"stride_w": 2,
"type": "Conv2DPackage"
},
"ctc": {
"type": "CTCCostLinear"
},
"fc": {
"batch_norm": true,
"dim": 1760,
"type": "FullyConnected"
}
}
}
The raw input is a spectrogram that is 161 x (minibatch x time).
bd layers are bi-directional vanilla RNNs
The CTCCostLinear layer includes a linear transform to the alphabet size followed by a softmax. In English the alphabet size is 29. The criterion is a CTC loss done in logspace.
All non-linearities are clipped ReLU units (max of 20).
I will update this with the dataset information soon.
The dataset should be drawn from the following distribution:
Length (sec) | Frequency (percent) | Label Length |
---|---|---|
1 | 3.0 | 7 |
2 | 10.0 | 17 |
3 | 11.0 | 35 |
4 | 13.0 | 48 |
5 | 14.0 | 62 |
6 | 13.0 | 78 |
7 | 9.0 | 93 |
8 | 8.0 | 107 |
9 | 5.0 | 120 |
10 | 4.0 | 134 |
11 | 3.0 | 148 |
12 | 2.0 | 163 |
13 | 2.0 | 178 |
14 | 2.0 | 193 |
15 | 1.0 | 209 |
Each second corresponds to 100 input timesteps as we use a 10ms step.
@ekelsen thanks for the specs! Could we get some information on how you chose the dataset specification?
It is similar to the distribution of one of our training sets.
What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above?
One could generate a training sample of different data lengths using that distribution - form minibatches so that a minibatch has utterances of equal length (that gets around the zero padding problem) and go from there. Mini batches should be all large as possible - but anything above 128 / GPU will either hit memory limits of GPUs first or unusable in practice (assuming multi-GPU training with 8 GPUs) due to convergence issues.
Shubho
On Mon, Jun 13, 2016 at 2:46 PM, nervetumer notifications@github.com wrote:
What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-225718552, or mute the thread https://github.com/notifications/unsubscribe/ABIPeQ61HPdGZNtA6t_OYo7yD8iLMJBcks5qLc-igaJpZM4Isw4w .
Just wanted to clarify that the benchmark can't test convergence at all - so maybe the minibatch should be wide enough to fit in GPU memory.
On Mon, Jun 13, 2016 at 10:56 PM, Shubho Sengupta shubho@gmail.com wrote:
One could generate a training sample of different data lengths using that distribution - form minibatches so that a minibatch has utterances of equal length (that gets around the zero padding problem) and go from there. Mini batches should be all large as possible - but anything above 128 / GPU will either hit memory limits of GPUs first or unusable in practice (assuming multi-GPU training with 8 GPUs) due to convergence issues.
Shubho
On Mon, Jun 13, 2016 at 2:46 PM, nervetumer notifications@github.com wrote:
What is the proper procedure for this benchmark. Are we to generate benchmarks for different input data lengths (1s, 2s, 3s, ..., 15s) and then take a weighted average of the runtimes using the distribution above?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-225718552, or mute the thread https://github.com/notifications/unsubscribe/ABIPeQ61HPdGZNtA6t_OYo7yD8iLMJBcks5qLc-igaJpZM4Isw4w .
I agree we could do that but then everyone will be benchmarking a different data set. It may not matter much for a large data set epoch but it seems like we should try to minimize the differences between all the benchmarks. So if we go this route maybe we should have a small python script here with a random number seed and random number generator that is platform independant which generates the sequence lengths? Or we should choose a publicly available dataset instead of using statistics from a private dataset.
I think we should have a script with a fixed seed to nail down the dataset.
Shubho
On Tuesday, June 14, 2016, nervetumer notifications@github.com wrote:
I agree we could do that but then everyone will be benchmarking a different data set. It may not matter much for a large data set epoch but it seems like we should try to minimize the differences between all the benchmarks. So if we go this route maybe we should have a small python script here with a random number seed and random number generator that is platform independant which generates the sequence lengths? Or we should choose a publicly available dataset instead of using statistics from a private dataset.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-225888577, or mute the thread https://github.com/notifications/unsubscribe/ABIPecvrbmOsotSZSvbsB_TsDI5arT_7ks5qLrMTgaJpZM4Isw4w .
I think that having to use real data for performance benchmarks is somewhat annoying and best avoided if possible.
@shubho will provide a python script to generate the dataset (input spectrograms and labels) so that the expected behavior is clear. I really don't think the exact floating point numbers and labels that are chosen will have any impact on the performance (we can test by changing the seed in python), so if it is easier to generate the dataset in a different language that should be fine as long the distribution is the same.
The minibatch is generally chosen to be the largest possible for the longest sequence length as we tend to keep the minibatch size constant during optimization. (There is work on variable mini-batches, but that isn't common, so I don't think that makes sense for the benchmark).
There are some odd performance cliffs when using CuBLAS, like going from a minibatch of 8 -> 9, and a minibatch of 96 significantly underperforms a minibatch of 64, so choosing the minibatch that is fastest overall might require some tuning. The nervana kernels mostly don't have these problems.
I doubt any of the frameworks will be able to exceed a global mini-batch of 1024, even on 8 GPUs. But it is true that in practice we notice degraded optimization performance beyond this mini-batch size. For benchmarking purposes I don't think we need to worry about that though.
The following script should be a reasonable generator for random data for this benchmark. The distribution of utterance lengths is fixed and does not depend on a random number generator and the generation itself should be quite fast and not affect overall benchmark timing.
If the chosen minibatch size is not a multiple of 2, then the last minibatch of a given utterance length will be smaller than usual. This is not exactly the same behavior as a real training system where we would lump together different length sequences. If people would prefer that behavior, let us know.
import numpy as np
class DataGenerator:
"""Generates DS2 test data for DeepMark benchmark.
Returns utterance length in number of 10ms slices. So utt_length
is set to 1000 for a 10s utterance.
Returns spectrogram filled with random input. This is a
two-dimensional Numpy array with dimensions
161 x (utt_length * mb_size) where mb_size is the user supplied
minibatch size.
If mb_size is not a multiple of two, then the last minibatch
for a particular utt_length may be less than mb_size.
Returns label data filled with random input. This is a
one-dimensional Numpy array with dimensions
label length corresponding to the utterance length.
"""
### Set up initial state
# Utterance lengths are in number of non-overlapping 10ms slices
_utt_lengths = [100, 200, 300, 400, 500, 600, 700,
800, 900, 1000, 1100, 1200, 1300, 1400, 1500]
_counts = [3, 10, 11, 13, 14, 13, 9,
8, 5, 4, 3, 2, 2, 2, 1]
_label_lengths = [7, 17, 35, 48, 62, 78, 93, 107,
120, 134, 148, 163, 178, 193, 209]
_freq_bins = 161
# 29 characters in english dataset - all equally likely to be
# selected for now
_prob_chars = [1 / 29.] * 29
_chars = range(29)
# minimum number of utterances to generate for a count of 1
_scale_factor = 10 * 128
# extra space to allow for different minibatch data even though
# we only generate one set of random numbers for speed
_extra = 1000
def __init__(self, minibatch_size):
self._current = 0
self._mb_size = minibatch_size
# Generate all the utterance lengths that we need
self._utt_counts = [self._scale_factor * x for x in self._counts]
# only generate random data once so that the data generation
# is as fast as possible and doesn't interfere with benchmark
# timing
self._randomness = np.random.randn(self._freq_bins,
minibatch_size *
(self._utt_lengths[-1]) +
self._extra
).astype(np.float32)
def __iter__(self):
return self
def next(self):
if self._current >= len(self._utt_counts):
raise StopIteration
else:
# Generate an utterance length
if (self._utt_counts[self._current] > self._mb_size):
mb_size = self._mb_size
self._utt_counts[self._current] -= self._mb_size
inc = 0
else:
mb_size = self._utt_counts[self._current]
self._utt_counts[self._current] = 0
inc = 1
utt_length = self._utt_lengths[self._current]
# Create random label data
label_length = self._label_lengths[self._current]
start = np.random.randint(0, self._extra +
self._mb_size *
(self._utt_lengths[-1] -
self._utt_lengths[self._current])
)
end = start + utt_length * mb_size
self._current += inc
return utt_length, \
self._randomness[:, start:end], \
np.random.choice(self._chars, label_length,
self._prob_chars)
Sounds great, thanks @ekelsen! Not sure what would be more fit for the torch benchmark; should I use a library to access the above python code in lua, or rewrite the class in lua? I personally prefer to rewrite, but whatever is more appropriate!
I feel rewriting is fine - the important parts are the distribution, total number of samples and the way they are divided into minibatches.
Shubho
On Wednesday, June 22, 2016, Sean Naren notifications@github.com wrote:
Sounds great, thanks @ekelsen https://github.com/ekelsen! Not sure what would be more fit for the torch benchmark; should I use a library to access the above python code in lua, or rewrite the class in lua? I personally prefer to rewrite, but whatever is more appropriate!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-227778295, or mute the thread https://github.com/notifications/unsubscribe/ABIPec5G9ecutKR3yEuzLamnoJbOnsnQks5qOVKtgaJpZM4Isw4w .
@shubho thanks, bit confused as to how the generator is to be used. Do the below steps cover what the benchmark using the generator is supposed to be?
Yeah and you can choose the appropriate minibatch that gives you the fastest time.
Shubho
On Thursday, June 23, 2016, Sean Naren notifications@github.com wrote:
Great, bit confused as to how the generator is to be used. Do the below steps cover what the benchmark using the generator is supposed to be?
- generator:next()
- Forward pass, record forward time
- Backward pass, record backward time
- loop from 1. until iterator finished
- Sum times
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-228029212, or mute the thread https://github.com/notifications/unsubscribe/ABIPeQdiO1zn52zTIMcAgfvXYbX2NdMzks5qOnW5gaJpZM4Isw4w .
Awesome so just to summarise:
Benchmark is the average forward/backward/forward+backward time taken to run through the entire synthetic dataset using the DS2 architecture (using fastest batch size).
Steps are:
Sorry if this is obvious, just trying to nail the details :)
I usually report total time for forward and back prop, number of minibatches and also average but I haven't checked DeepMark's requirements. What is reported should be consistent across all networks.
Shubho
On Thursday, June 23, 2016, Sean Naren notifications@github.com wrote:
Awesome so just to summarise:
Benchmark is the average forward/backward/forward+backward time taken to run through the entire synthetic dataset using the DS2 architecture (using fastest batch size).
Steps are:
- generator:next()
- Forward pass, record forward time
- Backward pass, record backward time
- loop from 1. until iterator finished
- Report average forward time/backward time/ forward+backward time
Sorry if this is obvious, just trying to nail the details :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-228155324, or mute the thread https://github.com/notifications/unsubscribe/ABIPeZLw3__ko_Zjk9pPxSPspqjXrD1gks5qOtuagaJpZM4Isw4w .
@shubho, I was going off the covnet benchmark structure for forward/backward/forward+backward average, but if you think total is a better measurement that could be an alternative, @soumith what would you suggest?
Consistency is more important than what I suggested.
Shubho
On Thursday, June 23, 2016, Sean Naren notifications@github.com wrote:
@shubho https://github.com/shubho, I was going off the covnet benchmark structure for forward/backward/forward+backward average, but if you think total is a better measurement that could be an alternative, @soumith https://github.com/soumith what would you suggest?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-228162933, or mute the thread https://github.com/notifications/unsubscribe/ABIPeea18Uvst4GDewb2l1HCGDi7ZnjWks5qOuLegaJpZM4Isw4w .
Interested to see how the other DS2 benchmarks are progressing, any news guys? cc @shubho @ekelsen @soumith @nervetumer
I am planning to get to the internal one this weekend.
Shubho
On Thu, Jul 7, 2016 at 5:46 AM, Sean Naren notifications@github.com wrote:
Interested to see how the other DS2 benchmarks are progressing, any news guys? cc @shubho https://github.com/shubho @ekelsen https://github.com/ekelsen @soumith https://github.com/soumith @nervetumer https://github.com/nervetumer
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DeepMark/deepmark/issues/1#issuecomment-231067259, or mute the thread https://github.com/notifications/unsubscribe/ABIPeU1O8eOc2CLsEH_5O7KMDfB_PXzHks5qTPVCgaJpZM4Isw4w .
Shall we start benchmarking numbers? Just to confirm we are measuring the time it takes to run through the entire dataset iterator?
@SeanNaren yea benchmarking through the dataset iterator sounds right. Time to benchmark!
Some preliminary results to get the ball rolling, I benchmarked a 1xTitan and a 4xTitan setup of the Torch implementation (shoutout to @digitalreasoning for letting me use their servers!) with 5 epochs of the dataset:
Hardware | Time (ms) | forward (ms) | backward (ms) | Samples processed | Samples processed per second | Seconds of audio processed per second | Epoch time (s) |
---|---|---|---|---|---|---|---|
1x Titans | 154 | 83 | 72 | 128000 | 32 | 189 | 4013 |
4x Titans | 180 | 99 | 81 | 128000 | 106 | 632 | 1204 |
I could only manage a 32 batch/GPU in memory for the epoch.
I suggest to add a column for multi-gpu scaling for the final presentation.
@SeanNaren, any particular reason you are not using cudnn.BatchNormalization in BatchBRNN.lua and use nn.BatchNormalization instead? Thanks for your work on this!
@ngimel cudnn.BatchNormalization only supports batchsize < 1024 in inference mode. I think this is the point.
@ngimel What @seed93 said :) Thanks for the changes on the benchmark, ill find time to re-run these!
EDIT: though if you manage to find a way we could use cuDNN batch norm please let me know (we could add cudnn.BatchNorm since we are only training, would be up for opionions on this)!
@SeanNaren, there are a few ways: 1) you can use cudnn batchnorm for training, and switch to nn for inference, since the modules are compatible. 2) cudnn bindings can be modified for inference to call cudnn batchnorm a few times with smaller batch sizes - bn at inference time is a pointwise operation, so results should not be affected. 3) I'll also check if this limitation can be removed for future cudnn versions. We've seen speedups using cudnn.BatchNorm instead of nn.BatchNorm, so I see no reason not to use it for training benchmark.
@ngimel sounds fair will modify the benchmark to use cuDNN BatchNorm! Don't want to spam this issue but want to address a few things with the torch implementation, I'll open a separate issue.
Hey @shubho , can you give some technical details on the DeepSpeech2 benchmark so that the others can implement it to your exact spec.
Some details:
cc: @seannaren @delta2323