ASR Training in Must-C Example Stuck in Pipeline

HildaNya commented 2 years ago

Hi. I'm working on the ASR training step from the Must-C example. After executing

python3 -m neurst.cli.run_exp \
    --config_paths /path_to_data/asr_st/asr_training_args.yml,/path_to_data/asr_st/asr_validation_args.yml \
    --hparams_set speech_transformer_s \
    --model_dir /path_to_data/asr_st/asr_benchmark

the training process became stuck for days.

Looking at the output, it seems like it got stuck at "Training for 200000 steps...Saving model configurations to directory:.." One of the last lines of output is this

Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.

I'm not sure if it's encountered an error or if it's just slow (due to GPU incompatibility). So I'm looking for general ideas. Also, just to double-check, in the middle of training, is there supposed to be output messages on the progress or just completely silent until training is finished?

Thanks!

zhaocq-nlp commented 2 years ago

Hi. There should be some output messages during training, e.g., global step=..., training loss=..., according to the argument summary_steps. So is there any messages about CUDA environment or does your GPU support mixed-precision training? By default, NeurST uses mixed-precision training and you can switch to normal training via --dtype float32.

HildaNya commented 2 years ago

Thanks for the tip. I'm still trying to figure out the issue. Another dumb question though: what is the path that training weights are automatically stored? Is it the same path as the model configuration,

/path_to_data/asr_st/asr_benchmark

? Sorry for the trivial question. I'm just trying to double-check all possible factors that could go wrong. Thanks!

zhaocq-nlp commented 2 years ago

Hi. You need to change this path according to your file system.

HildaNya commented 2 years ago

Thanks for all the help. Turns out, it was running, just EXTREMELY slowly because I was using CPU instead of GPU.

zhaocq-nlp commented 2 years ago

Hi, just one more suggestion. Turn on --dtype float32 option when using CPU because CPU does not support mixed precision computation.

HildaNya commented 2 years ago

You are a life-saver.

bytedance / neurst

ASR Training in Must-C Example Stuck in Pipeline #56