bytedance / neurst

Neural end-to-end Speech Translation Toolkit
Other
298 stars 45 forks source link

speech_transformer/agument_librispeech如何设置命令使用GPU进行训练? #55

Closed jannicaTan closed 2 years ago

jannicaTan commented 2 years ago

您好,我目前在跑speech_transformer/agument_librispeech的代码 tensorflow版本已为2.4.1,前期没有问题 在Training with validation时使用示例代码:

python3 -m neurst.cli.run_exp \
    --config_paths /path_to_data/asr_st/asr_training_args.yml,/path_to_data/asr_st/asr_validation_args.yml \
    --hparams_set speech_transformer_s \
    --model_dir /path_to_data/asr_st/asr_benchmark

发现速度很慢,查看后发现并没有使用到gpu,而是使用的CPU,请问怎样设置命令才能使用GPU? 2.在设置--update_cycle n --batch_size 120000//n显示 error: argument --batch_size: invalid int value: '120000//n' 请问这里的n是否要替换为数字?

zhaocq-nlp commented 2 years ago

你好,

  1. 请问有没有CUDA库相关的日志信息,有可能是CUDA版本不匹配。`
  2. n需要替换成你想要的梯度累积次数
jannicaTan commented 2 years ago

你好,

  1. 请问有没有CUDA库相关的日志信息,有可能是CUDA版本不匹配。`
  2. n需要替换成你想要的梯度累积次数

您好 我注意到我的问题和问题#56是相似的 我还查看了我的cuda版本为11.2 tf版本为2.4.1 我尝试了将tf版本切换为2.5.0以上,这样可以配合cuda11.2,但是会出现#54的问题,我就把它又切换回去了

最后的log打印为:

. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
I0526 22:41:14.333542 140599115236032 configurable.py:296] Saving model configurations to directory: path_to_data/asr_st/asr_benchmark
2022-05-26 22:42:02.487136: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2663055360 exceeds 10% of free system memory.
zhaocq-nlp commented 2 years ago

W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2663055360 exceeds 10% of free system memory 应该是内存不够,降低一下batch size试试

jannicaTan commented 2 years ago

W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2663055360 exceeds 10% of free system memory 应该是内存不够,降低一下batch size试试

我降低了batch size,同时增加了GPU,解决了这个问题。谢谢~