Closed lunixbochs closed 3 years ago
What's type of audio file do you use? Is it .flac 16kHz already?
I have encountered the large audio file and it took a long run for each epoch.
Update: 1 machine, 8 GPUs: around 2000 sec/sec 2 machines: 1000 sec/sec 4 machines: around 2000 sec/sec
The same kind of scaling regression happened when going from 1x 4-GPU machine to 2.
A similar issue happened when scaling to multiple GPUs on a single machine. On one machine, 2 v100s gave 800 sec/sec. So did 4. But 8 gave 2300 sec/sec.
I tried tuning a bunch of NCCL and MPI flags and only ever managed to make multi machine training slower.
This is google cloud, nvlink 16GB SXM V100s, and 32gbit network between machines that is not saturated (I get up to about 24gbit peak)
Any idea what is going on to cause nonlinear scaling with this model?
Hi,
hrs
in the log file denotes the number of hours process. It should be 16 time more for 2x GPUs vs 32 GPUs. So, something seems wrong with the setting. Can you send the complete log for training flow for 2x and 32x GPUs.
I'm assuming you kept the batchsize same in both settings
I'm not sure what you're expecting to get out of the full log. It's the same as the the log for a single machine, but with some mpirun output and a bunch of overlapping prints (e.g. 32 processes printing at the same time).
The hrs is correct, it's for an entire epoch. The two test machines have slightly different datasets, and the hrs I posted represent one epoch on their respective datasets.
Batchsize is the same for all tests (8). I copied the base flags exactly from your streaming convnets recipe, and I used the same mpirun -n N
command for each test, where N controls the number of GPUs. N=2, 4, 8 for the single machine test, and N=8, 16, 32 for the 1, 2, 4 machine test.
I don't have logs from these tests at this time, so I'd need to spin up machines and run them all again to collect logs and waiting for the output can be pretty time consuming.
I am experiencing similar scaling issues as lunixbochs.
I did some more investigations and for me, the issue does not happen when using smaller reportiters. (I am training on 6x rtx8000).
The first interval of each epoch is relatively slow (1000sec/sec), all other intervals in the same epoch are over 6000 sec/sec.
Increasing the reportiters, increases the time it takes for the first interval. If i set it to reportiters 0, the runtime is several hours for 1 epoch, when i use reportiters 1000, the same epoch will take ~30 minutes. I notice a lot of reloads with nvidia-smi during the first interval, i barely see any in the next intervals.
This is the output with reportiters =500. If i put larger reportiters=0, it will process the full epoch at the same slow speed as the first interval with reportiters=500. Batchsize = 48 in this case. (it seems too large only for the first interval).
I0327 10:58:53.854261 28948 Train.cpp:359] epoch: 35 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:06:30 | bch(ms): 782.00 | smp(ms): 9.62 | fwd(ms): 451.04 | crit-fwd(ms): 26.74 | bwd(ms): 296.97 | optim(ms): 22.03 | loss: 10.43004 | train-TER: 44.17 | train-WER: 63.59 | a/2/dev.lst-loss: 4.81893 | a/2/dev.lst-TER: 12.49 | a/2/dev.lst-WER: 39.79 | a/1/dev.lst-loss: 4.09435 | a/1/dev.lst-TER: 11.79 | a/1/dev.lst-WER: 30.87 | avg-isz: 449 | avg-tsz: 024 | max-tsz: 120 | hrs: 179.88 | thrpt(sec/sec): 1656.20 I0327 11:00:59.869719 28948 Train.cpp:359] epoch: 35 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:01:51 | bch(ms): 222.75 | smp(ms): 9.40 | fwd(ms): 140.94 | crit-fwd(ms): 8.88 | bwd(ms): 65.57 | optim(ms): 2.59 | loss: 6.82596 | train-TER: 21.48 | train-WER: 41.96 | a/2/dev.lst-loss: 4.80092 | a/2/dev.lst-TER: 13.00 | a/2/dev.lst-WER: 40.19 | a/1/dev.lst-loss: 4.24256 | a/1/dev.lst-TER: 13.47 | a/1/dev.lst-WER: 33.86 | avg-isz: 500 | avg-tsz: 026 | max-tsz: 117 | hrs: 200.07 | thrpt(sec/sec): 6467.00
@lunixbochs there are any tutorial to train in multi-machine using cpu in each machine. can you explain what you done. thank you
Multi machine training wasn’t faster for me on this model anyway
I would also be curious to know the training time. I have been using a custom dataset and getting 8h per epoch (35 days for 100 epochs) with 2 x 2080 TI. For reference, in February 2019, the 100m parameter architecture in the original wav2letter++ paper took 5 days to converge with the same dataset and hardware. Seems a bit odd
How many hours of data @AVSuni? I get around 2600 sec/sec with 8 V100s on Google Cloud. What's your avg/min/max isz and tsz? That can effect perf too.
As per this: https://github.com/facebookresearch/wav2letter/issues/577 You should definitely fork facebook's librivox model with transfer learning if you're training sconv, it will save you a lot of time and converge much better than training from scratch (even if you're not training English)
Thanks @lunixbochs . This is with only 3000h of data. I don't filter out any data at the training stage, but I have cleaned the dataset earlier. I got fantastic results with this dataset last year (validation WER around 5%). The audio files (wav) are mostly about 15s in duration. During training of streaming convnets, I get about 925 sec/sec. I will try to change the final layer on the pretrained model. Thanks for the tip.
Update: I changed the final layer on the pretrained model. It helped very little, so I continued to train a new model that started to converge quickly after 5 epochs. With 16k wordpiece tokens I get 925 sec/sec, with 10k tokens 1050 sec/sec and with 31 tokens (char level) 1145 sec/sec.
Hi @AVSuni,
Did your WER decrease soon after training start with your 3000h dataset? Could you share your config file?
I try training a streaming convnet with 1000h french data, real-world audio recordings (meeting conversations, conference, phone interviews), wav format, with librispeech streaming convnet config file and architecture file but it wouldn't learn after 10 epochs. Train-WER/TER remain around 99-100%. I created lexicon/token files following libripseech data prep recipe. I've also tried different configs with various batch sizes (8 to 64), learning rates (0.1 to 2), warmup/no warmup, momentum/no momentum but nothing changes.
I've trained Kaldi and Espnet models working fine with this data set, so maybe some parameter in the config is not correctly set. Any idea about this behaviour or known difficulties to make streaming models converge ?
cc @vineelpratap
@lunixbochs: Did you achieve any progress in speeding up training in a multi-GPU setup? Care to post any stats here - # GPUs, hours/days/weeks trained, dataset size, etc.?
It would help a bunch of us in making decisions around getting our own hardware/paying for cloud usage.
Thanks!
@vineelpratap:
hrs in the log file denotes the number of hours process
This is from your previous reply in this thread. Can you please clarify this more? I am not sure what you meant here by hours process
. Thanks!
My last info about multi-gpu is here. wav2letter seems to have non-GPU bottlenecks when used with V100s that I haven't identified. I don't currently recommend distributed training unless you use DGX class hardware and I'm not generally interested in answering further questions about distributed training.
I traing the example librispeech According to the librispeech streaming convnet config file and architecture file ,the log on 1 gpu is: epoch: 1 | nupdates: 1000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:03:59 | bch(ms): 239.84 | smp(ms): 15.52 | fwd(ms): 75.68 | crit-fwd(ms): 9.04 | bwd(ms): 128.83 | optim(ms): 18.31 | loss: 51.60863 | train-TER: 89.33 | train-WER: 92.41 | dev-clean-loss: 45.90081 | dev-clean-TER: 91.36 | dev-clean-WER: 95.52 | dev-other-loss: 45.33203 | dev-other-TER: 90.39 | dev-other-WER: 95.39 | avg-isz: 1203 | avg-tsz: 036 | max-tsz: 076 | avr-batchsz: 8.00 | hrs: 26.75 | thrpt(sec/sec): 401.51 | timestamp: 2020-12-31 01:39:19 Memory Manager Stats Type: CachingMemoryManager Device: 0, Capacity: 11.75 GiB, Allocated: 7.26 GiB, Cached: 6.82 GiB Total native calls: 345(mallocs), 0(frees) I1231 01:43:56.902114 400780 Train.cpp:573] epoch: 1 | nupdates: 2000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:03:55 | bch(ms): 235.30 | smp(ms): 12.93 | fwd(ms): 73.17 | crit-fwd(ms): 9.35 | bwd(ms): 132.03 | optim(ms): 16.90 | loss: 44.52647 | train-TER: 82.72 | train-WER: 92.20 | dev-clean-loss: 34.29783 | dev-clean-TER: 90.51 | dev-clean-WER: 93.77 | dev-other-loss: 32.26878 | dev-other-TER: 90.55 | dev-other-WER: 94.77 | avg-isz: 1216 | avg-tsz: 036 | max-tsz: 072 | avr-batchsz: 8.00 | hrs: 27.03 | thrpt(sec/sec): 413.53 | timestamp: 2020-12-31 01:43:56 on 8 GPU is: epoch: 1 | nupdates: 1000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:09:21 | bch(ms): 561.62 | smp(ms): 54.10 | fwd(ms): 78.23 | crit-fwd(ms): 9.23 | bwd(ms): 409.94 | optim(ms): 20.91 | loss: 52.50099 | train-TER: 90.49 | train-WER: 94.64 | dev-clean-loss: 36.04512 | dev-clean-TER: 87.67 | dev-clean-WER: 93.39 | dev-other-loss: 34.13421 | dev-other-TER: 87.61 | dev-other-WER: 94.34 | avg-isz: 1238 | avg-tsz: 037 | max-tsz: 074 | avr-batchsz: 8.00 | hrs: 220.14 | thrpt(sec/sec): 1411.11 | timestamp: 2020-12-31 02:17:43 Memory Manager Stats Type: CachingMemoryManager Device: 0, Capacity: 11.75 GiB, Allocated: 7.26 GiB, Cached: 6.81 GiB Total native calls: 307(mallocs), 0(frees) I1231 02:26:58.803328 400900 Train.cpp:573] epoch: 1 | nupdates: 2000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:08:58 | bch(ms): 538.58 | smp(ms): 36.44 | fwd(ms): 72.44 | crit-fwd(ms): 9.30 | bwd(ms): 409.67 | optim(ms): 19.96 | loss: 43.93249 | train-TER: 81.77 | train-WER: 91.79 | dev-clean-loss: 35.11845 | dev-clean-TER: 91.56 | dev-clean-WER: 94.40 | dev-other-loss: 33.08693 | dev-other-TER: 92.18 | dev-other-WER: 95.44 | avg-isz: 1234 | avg-tsz: 037 | max-tsz: 076 | avr-batchsz: 8.00 | hrs: 219.38 | thrpt(sec/sec): 1466.41 | timestamp: 2020-12-31 02:26:58
Why is it slower ?
@jinggaizi could you provide details on the flashlight/wav2letter commits with which you got these runtimes?
wav2letter is 9747f428f918282f25ea1c3093e71a5a3e8c611d and flashlight is 2cb61a73f28775d35cf561bd8b0772b4accf5025, i run task on Titan V 1 gpu: fl_asr_train train -flagsfile librispeech/train_am_500ms_future_context.cfg --minloglevel=0 --logtostderr=1 8 gou: mpirun -n 8 fl_asr_train train -flagsfile librispeech/train_am_500ms_future_context.cfg --minloglevel=0 --logtostderr=1
what is cuda version?
cc @jacobkahn @vineelpratap @avidov
cuda version is 10.0
Could you tell what -DCMAKE_BUILD_TYPE you set? Please use either Release or RelWithDebInfo (not Debug).
I'm trying 32x V100s on only 5,500 hours of streaming convnets, and I'm only getting an epoch every 2 hours. So training to 110 epochs as you did, would take me 9 days. Or 90 days if I trained on 60k hours as you did. I get the feeling that maybe you didn't spend 90 days training it and something is wrong with my training setup.
What sort of sec/sec and epoch times did you get when training streaming convnets on 32 V100s?
I'm training this on two machines right now:
32x V100s, 4 machines with 8x 16GB SXM + nvlink, 5.5k hours:
2x V100s, 1 machine with 2x 32GB PCIe + nvlink, 2.8k hours:
I'm using an almost identical arch and flagsfile to this recipe: https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/streaming_convnets/librispeech
Edit: With just one of the 8x machines, I just got 2154.59 sec/sec for 1k iters, so something is really not working well here :/