flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Training WSJ #225

Closed hhadian closed 5 years ago

hhadian commented 5 years ago

It's not an issue, I just want to ask some questions to make sure all is good. Please let me know if I need to ask this on some other forum.

I am training WSJ using a single Tesla K80 GPU with the default configs. I didn't see an option related to the number of GPUs in the configs.

I also did not see an option to set the number of epochs. So far it has trained for almost 36 hours. Here are the last few lines of 001_perf:

2019-03-01 12:08:34       32 5.600000 0.004000 01:12:35 465.64 0.31 184.36 18.72 247.35 7.06    1.92669  6.95  9.82 782 122 255   81.29 67.19
2019-03-01 13:21:23       33 5.600000 0.004000 01:12:34 465.53 0.31 184.24 18.62 247.34 7.07    1.88890  6.81  9.53 782 122 255   81.29 67.20
2019-03-01 14:34:18       34 5.600000 0.004000 01:12:40 466.14 0.31 184.66 18.97 247.51 7.08    1.85114  6.67  9.75 782 122 255   81.29 67.12
2019-03-01 15:47:07       35 5.600000 0.004000 01:12:34 465.48 0.31 184.30 18.65 247.25 7.06    1.82326  6.58  9.62 782 122 255   81.29 67.21

My questions:

  1. Can I assume that W2L can detect the number of available GPUs automatically and will adjust the learning rate accordingly?
  2. I'm wondering when the training will finish? How many more epochs is it going to do?
  3. What is TER and is the training going well?

Thanks in advance

jacobkahn commented 5 years ago

@hhadian:

  1. You need to explicitly specify the number of GPUs involved in training: flashlight uses MPI to spawn one thread per GPU on the host. The flashlight distributed docs should be helpful here. Gradients synchronized over all GPUs will be scaled based on the number of GPUs.
  2. Training continues until stopped. The iter flag lets you control how many epochs training continues over, but we don't have the ability to condition this on WER (this should be pretty easy to do via a simple modification to Train.cpp)
  3. If you're not seeing any logs in stdout/stderr, I'd check out https://github.com/facebookresearch/wav2letter/issues/141 — they should surface TER. Based on your log file, I the train and dev set TERs should be in the last two columns to the right.
hhadian commented 5 years ago

Thanks for the answers. Do you mean there is no stopping criterion? How can I achieve the reported WER (3.5) on WSJ? I can see the TERs. I'm just not sure what TER means.

jacobkahn commented 5 years ago

@hhadian — TER is the "token error rate" (similar to LER/letter error rate). The acoustic model's emissions are a probability distribution over a set of tokens for a given frame: the AM doesn't emit words.

In order to turn those emissions for each frame into words (and compute WER/word error rate), the w2l decoder combines those emissions with a lexicon and language model scoring and performs a beam search over candidate words for each frame. The decoder will generate final transcripts given the emissions from the acoustic model. Take a look at the docs for more.

hhadian commented 5 years ago

OK, cool. I guess you missed my other question. It's a bit strange to not have any stopping criterions. When should the training be stopped?

jacobkahn commented 5 years ago

When should the training be stopped?

There's no right answer to this — this is also an open research question. In general, with a good model, convergence is relatively easy to recognize from a simple dev-LER plot per-epoch; it's somewhat obvious when the model isn't improving further (there will typically be some oscillation around some minima, but LER isn't dropping much over time).

I recently trained a baseline model with the librispeech architecture from the open source tutorial on another dataset, and it produced a curve like this (this is LER on the dev set):

Screen Shot 2019-03-09 at 10 48 52 AM