Speech recognition V100 not faster than 1080Ti

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.54k stars 369 forks source link

Speech recognition V100 not faster than 1080Ti #340

Closed byan23 closed 5 years ago

byan23 commented 5 years ago

I was trying to measure how much speedup I can get by using Volta (V100-PCIE) but found none... and seems even a bit slower. On both machines I used nvcr.io/nvidia/tensorflow:18.12-py3 docker image. Tried both DeepSpeech2 and w2lplus running with horovod on 2 GPUs (2V100 vs. 2 1080Ti). batch_size > 8 (since I've heard volta doesn't play well with small batch size)

Since starting the training is fairly simple and straightforward, I currently have no clue what might go wrong.

Something that I've verified:

GPU-util pattern during trainig: stayed at 70% - 100% and went 0% for a few seconds than back to 70-100%
I added tensorflow timeline during OS2S training and figured even CONV2D is slower on V100 cards.
I ran tf cnn benchmark in the same container on the same cards (1080ti 2 vs. V100 2). V100 gives a 1.5x speedup compared with 1080ti. So sounds like the environment is good,

borisgin commented 5 years ago

Can you send the system config and the log file for w2lplus, please?

byan23 commented 5 years ago

Attached is a w2lplus run on the V100 node using 2 GPU (horovod, so the num_gpu in the log doesn't count) stdout_w2lplus.log

Here is my system config: CPU: Intel Xeon E5-2680v4 x2 Mem: 256GB GPU: Tesla V100-PCIE-16GB x 8 Disk: 1T SSD OS: ubuntu 16.04 nvidia driver version: 410.72 docker image: nvcr.io/nvidia/tensorflow:18.12-py3 V100

byan23 commented 5 years ago

Also checkout the timeline I mentioned. These were from runs with w2l_large_8gpus.py but batch_size=16. V100: v100

1080ti: 1080ti

borisgin commented 5 years ago

For benchmarking of W2l+ on GV100, I would suggest

Set dtype : “mixed” instead of tf.float
use largest batch 32 /64/… (depends on amount of memory on the card). Note that you can use 2x larger batch for mixed precision than for tf.float. This will give you 2x samples per step.
remove evaluation overhead (evaluation steps and Print samples steps: 10000 )
remove summaries Usually I run couple epochs to remove “TF warm-up” effect, since framework tries different algorithms during initial stage. Do you use NVLink between V100s?

byan23 commented 5 years ago

Thank you for the advice. I'll try removing eval and summaries.

"mixed" is my next step. I'd like to first understand why V100 is not giving me speedup. It's just bizarre... I have noticed the warm-up effect and I've had runs training for hours but still don't see the speedup.

My cards are V100-PCIE, so no NVLink.

borisgin commented 5 years ago

This is strange, we optimized net for mixed precision, but Even with float V100 is usually much faster than 1080. What’s about one V100 vs one 1080?

mehmedes commented 5 years ago

Just wondering what the expected speed up should actually be? In fairseq they have a speed up by factor 3 by using mixed precision! https://arxiv.org/abs/1806.00187

In T2T, with FP16 activation + weight = float16 and adafactor the max. batch size can be increased by factor 4 vs. FP32! The throughput increases by factor 2 but still not as fast as fairseq.

ROBLEM=summarize_cnn_dailymail32k
MODEL=transformer
HPARAMS=transformer_tpu (adafactor, activation_dtype + weight_dtype=float16)

adafactor + FP32	adafactor + FP16 activation	adafactor + FP16 activation + weight = float16
max. batch size	8.192 * 2 Volta = 16.384	16.384 * 2 Volta 32gb sxm= 32.768	32.768 * 2 Volta 32gb sxm = 65.536
step/sec	2.6	1.95	1.3
batch size * step/sec	42.598	63.898	85.197

https://github.com/tensorflow/tensor2tensor/issues/1221#issuecomment-455760941

borisgin commented 5 years ago

For Jasper speed up is 2-3x. DS2 is more tricky , it requires new cudnn kernels for strided conv layer. The speed-up heavily depends on the batch size (~GPU DRAM), how frameworks scales for multi-GPU (~ nvlink, horovod,...), the efficiency of host data layer (IO and CPU speed, parallel data layer, caching, queue size,) , how framework selects cudnn algorithm for specific layer... For example, the current transformer in OS2S has 2-3x speedup. One can speed-up Transformer father through more fast CUDA implementation of layer norm, etc.

borisgin commented 5 years ago

For DS2: I added new data_format to DeepSpeech2: "BCFT" (Batch x Ch x Freq x Time). It will be give fast training starting from cudnn 7.4.1. Please: 1) update to cuda 10 and cudnn 7.4.1. 2) pull latest openSeq2seq. 3) change ds2_large_8gpus_mp.py: "num_gpus": 1, # to eliminate mulit-gpu synch overhead "data_format": "BCFT", # instead of "channel_first" "batch_size_per_gpu": 64 # for V100-32GB. use 32 for V100-16 GB On my machine I see speed-up ~1.2x vs float32 for 1 GPU

mrgloom commented 5 years ago

@ byan23 What did you use to create plot with ops timeline?