flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

CPU training can't complete epoch 1 #210

Closed duytruong closed 5 years ago

duytruong commented 5 years ago

Hello,

I built wav2letter with CPU option and run on aws ec2 instance (c5.2xlarge - 8 vCPU / 16G RAM / Ubuntu 16.04 LTS). I followed the 1-librispeech_lean tutorial but the training job runs for ~ 18 hours and stuck in epoch 1.

...
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:00 | bch(ms): 982.11 | smp(ms): 0.71 | fwd(ms): 206.30 | crit-fwd(ms): 31.82 | bwd(ms): 747.02 | optim(ms): 26.07 | loss:   45.20162 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1591 | avg-tsz: 277 | max-tsz: 277 | hrs:    0.02 | thrpt(sec/sec): 64.80
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:00 | bch(ms): 590.51 | smp(ms): 0.62 | fwd(ms): 147.60 | crit-fwd(ms): 18.88 | bwd(ms): 414.77 | optim(ms): 25.94 | loss:   37.17485 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1214 | avg-tsz: 218 | max-tsz: 218 | hrs:    0.01 | thrpt(sec/sec): 82.23
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:01 | bch(ms): 1076.54 | smp(ms): 0.60 | fwd(ms): 209.85 | crit-fwd(ms): 34.41 | bwd(ms): 837.83 | optim(ms): 26.20 | loss:   45.21467 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1642 | avg-tsz: 290 | max-tsz: 290 | hrs:    0.02 | thrpt(sec/sec): 61.01
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:00 | bch(ms): 510.07 | smp(ms): 0.63 | fwd(ms): 167.87 | crit-fwd(ms): 24.19 | bwd(ms): 314.16 | optim(ms): 25.76 | loss:   40.35663 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1309 | avg-tsz: 244 | max-tsz: 244 | hrs:    0.01 | thrpt(sec/sec): 102.65
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:00 | bch(ms): 789.98 | smp(ms): 0.56 | fwd(ms): 188.92 | crit-fwd(ms): 29.86 | bwd(ms): 572.59 | optim(ms): 26.05 | loss:   43.93547 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1491 | avg-tsz: 249 | max-tsz: 249 | hrs:    0.02 | thrpt(sec/sec): 75.50
...

Is it a normal situation with CPU training or something went wrong? Thanks for your help!

jacobkahn commented 5 years ago

@duytruong — the CPU backend is a lot slower than the CUDA backend, so training will be slow, especially on Librispeech. A few questions:

duytruong commented 5 years ago

@jacobkahn

jacobkahn commented 5 years ago

@duytruong — this sounds like an issue reading in data if operations are completing immediately. Can you put some logs in the training pipeline to ensure data is being loaded properly? Check here perhaps: Train.cpp#L490.

duytruong commented 5 years ago

@jacobkahn Thanks, I'll give it a try.

jacobkahn commented 5 years ago

Closing due to inactivity — feel free to reopen if you're still having trouble.