CPU training can't complete epoch 1

duytruong commented 5 years ago

Hello,

I built wav2letter with CPU option and run on aws ec2 instance (c5.2xlarge - 8 vCPU / 16G RAM / Ubuntu 16.04 LTS). I followed the 1-librispeech_lean tutorial but the training job runs for ~ 18 hours and stuck in epoch 1.

...
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:00 | bch(ms): 982.11 | smp(ms): 0.71 | fwd(ms): 206.30 | crit-fwd(ms): 31.82 | bwd(ms): 747.02 | optim(ms): 26.07 | loss:   45.20162 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1591 | avg-tsz: 277 | max-tsz: 277 | hrs:    0.02 | thrpt(sec/sec): 64.80
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:00 | bch(ms): 590.51 | smp(ms): 0.62 | fwd(ms): 147.60 | crit-fwd(ms): 18.88 | bwd(ms): 414.77 | optim(ms): 25.94 | loss:   37.17485 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1214 | avg-tsz: 218 | max-tsz: 218 | hrs:    0.01 | thrpt(sec/sec): 82.23
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:01 | bch(ms): 1076.54 | smp(ms): 0.60 | fwd(ms): 209.85 | crit-fwd(ms): 34.41 | bwd(ms): 837.83 | optim(ms): 26.20 | loss:   45.21467 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1642 | avg-tsz: 290 | max-tsz: 290 | hrs:    0.02 | thrpt(sec/sec): 61.01
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:00 | bch(ms): 510.07 | smp(ms): 0.63 | fwd(ms): 167.87 | crit-fwd(ms): 24.19 | bwd(ms): 314.16 | optim(ms): 25.76 | loss:   40.35663 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1309 | avg-tsz: 244 | max-tsz: 244 | hrs:    0.01 | thrpt(sec/sec): 102.65
epoch:        1 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:00:00 | bch(ms): 789.98 | smp(ms): 0.56 | fwd(ms): 188.92 | crit-fwd(ms): 29.86 | bwd(ms): 572.59 | optim(ms): 26.05 | loss:   43.93547 | train-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1491 | avg-tsz: 249 | max-tsz: 249 | hrs:    0.02 | thrpt(sec/sec): 75.50
...

Is it a normal situation with CPU training or something went wrong? Thanks for your help!

jacobkahn commented 5 years ago

@duytruong — the CPU backend is a lot slower than the CUDA backend, so training will be slow, especially on Librispeech. A few questions:

What is your reportiters value here? Was there evidence it was processing some samples?
Did you use the same network architecture as is specified in the recipe?
Can you check and report CPU usage while the job is running?

duytruong commented 5 years ago

@jacobkahn

I added --reportiters=1 to my ./Train ... command. I didn't know it was processing or not, but the runtime in logs was 00:00:00 or 00:00:01 hence I thought there was no processing here.
Yes, I didn't change the network arch
CPU usage about 40%

jacobkahn commented 5 years ago

@duytruong — this sounds like an issue reading in data if operations are completing immediately. Can you put some logs in the training pipeline to ensure data is being loaded properly? Check here perhaps: Train.cpp#L490.

duytruong commented 5 years ago

@jacobkahn Thanks, I'll give it a try.

jacobkahn commented 5 years ago

Closing due to inactivity — feel free to reopen if you're still having trouble.

flashlight / wav2letter

CPU training can't complete epoch 1 #210