speech_transformer/agument_librispeech如何设置命令使用GPU进行训练？

jannicaTan commented 2 years ago

您好，我目前在跑speech_transformer/agument_librispeech的代码 tensorflow版本已为2.4.1，前期没有问题在Training with validation时使用示例代码：

python3 -m neurst.cli.run_exp \
    --config_paths /path_to_data/asr_st/asr_training_args.yml,/path_to_data/asr_st/asr_validation_args.yml \
    --hparams_set speech_transformer_s \
    --model_dir /path_to_data/asr_st/asr_benchmark

发现速度很慢，查看后发现并没有使用到gpu，而是使用的CPU，请问怎样设置命令才能使用GPU？ 2.在设置--update_cycle n --batch_size 120000//n显示 error: argument --batch_size: invalid int value: '120000//n' 请问这里的n是否要替换为数字？

zhaocq-nlp commented 2 years ago

你好，

请问有没有CUDA库相关的日志信息，有可能是CUDA版本不匹配。`
n需要替换成你想要的梯度累积次数

jannicaTan commented 2 years ago

你好，

请问有没有CUDA库相关的日志信息，有可能是CUDA版本不匹配。`

n需要替换成你想要的梯度累积次数

您好我注意到我的问题和问题#56是相似的我还查看了我的cuda版本为11.2 tf版本为2.4.1 我尝试了将tf版本切换为2.5.0以上，这样可以配合cuda11.2，但是会出现#54的问题，我就把它又切换回去了

最后的log打印为：

. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
I0526 22:41:14.333542 140599115236032 configurable.py:296] Saving model configurations to directory: path_to_data/asr_st/asr_benchmark
2022-05-26 22:42:02.487136: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2663055360 exceeds 10% of free system memory.

zhaocq-nlp commented 2 years ago

W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2663055360 exceeds 10% of free system memory 应该是内存不够，降低一下batch size试试

jannicaTan commented 2 years ago

W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2663055360 exceeds 10% of free system memory 应该是内存不够，降低一下batch size试试

我降低了batch size，同时增加了GPU，解决了这个问题。谢谢～

bytedance / neurst

speech_transformer/agument_librispeech如何设置命令使用GPU进行训练？ #55