NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

speech recognition training time #397

Closed inchpunch closed 5 years ago

inchpunch commented 5 years ago

I am using 4 GPUs (tesla v100-sxm2-32gb). Except changing the number of gpus, I just used the example configuration files, so all other parameters are the same as original. For ds2_large_mp, Jasper_10x3_mp, and w2l_plus_large_mp, the "time per step" are all around 1~2 second. Is that expected?

borisgin commented 5 years ago

Do you use NVIDIA container?

inchpunch commented 5 years ago

Yes, I used version 18.12

borisgin commented 5 years ago

Can you attach a complete log file for Jasper, please?

inchpunch commented 5 years ago

I am generating it ... BTW, I changed the greedy decoder to be a beam search decoder but with beam size 1 (and without using LM). I think that is equivalent so would not change the speed. That is the only change in the code:

The code that I modified is in the fc_decoders.py (starting line 240):

else: def decode_without_lm(logits, decoder_input, merge_repeated=True): if logits.dtype.base_dtype != tf.float32: logits = tf.cast(logits, tf.float32)

decoded, neg_sum_logits = tf.nn.ctc_greedy_decoder(

    # logits, decoder_input['encoder_output']['src_length'],
    # merge_repeated,
# )
decoded, neg_sum_logits = tf.nn.ctc_beam_search_decoder(
    logits, decoder_input['encoder_output']['src_length'],
    self.params['beam_width'], 1, merge_repeated=False,
)
return decoded

and in configuration file, in base_params I set

"decoder_params": {

# params for decoding the sequence with language model
"beam_width": 1,
inchpunch commented 5 years ago

This is what was printed so far. It then waits a while to produce validation WER. I will add more output if needed. Thanks a lot.

jasper_logs.txt

borisgin commented 5 years ago

Can you re-run with few changes please: 'num_gpus': 1, 'save_checkpoint_steps': 10000, 'eval_steps': 10000,

inchpunch commented 5 years ago

Sure, now I got: jasper_logs_1gpu.txt

inchpunch commented 5 years ago

Just found that our GPUs and programs/dataset are not in the same physical area such that data loading/access time is probably long due to the long distance connection. I will check again after moving my programs/dataset to the same location where we have GPUs.

inchpunch commented 5 years ago

I have moved my programs and data to the same location with GPUs. But the speed does not change much. Still around 1~2 second for "time per step" for Jasper 10x3. That translate to around 20 epochs per day.

For Jasper 5x3 (by keeping one block for each of B1 to B5), with same batch size per GPU and total 4 GPUs, time per step is about 0.8 sec.

I recall that for DS2 small model, it is said to train for 1 day with 1 GPU 12GB memory. That is for 12 epochs, and it only uses librivox-train-clean-100 and librivox-train-clean-360.

So it looks the speed on Jasper 10x3 is still expected?