Closed sunxiaojie99 closed 2 years ago
如果用2个卡的话 train.log 显示如下信息,也无法运行:
INFO 2022-04-17 10:29:54,201 launch.py:311] Local processes completed.
/data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/clip.py:697: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
warnings.warn("Caution! 'set_gradient_clip' is not recommended "
/data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/contrib/mixed_precision/decorator.py:447: UserWarning: The decorated optimizer has its own minimize
method, but it will not be executed.
"The decorated optimizer has its own minimize
method, but it will not be executed."
/data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/incubate/fleet/collective/init.py:394: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
"set use_hierarchical_allreduce=False since you only have 1 node."
[WARNING] 2022-04-17 10:29:42,526 [ init.py: 394]: set use_hierarchical_allreduce=False since you only have 1 node.
API is deprecated since 2.0.0 Please use FleetAPI instead. WIKI: https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler
server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:50132'] server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:50132']
对于dual-encoder的单卡训练,需要修改script/run_dual_encoder_train.sh,将其中的use_cross_batch设为False (见readme中:For single-gpu training, please turn off the option use_cross_batch in script/run_dual_encoder_train.sh.)
对于多卡训练,给出的这部分log是正常的,没有出现错误信息。正常情况下在出现: server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:50132'] 这个日志后,稍等一会是能够正常开始训练的。如果后面有额外的错误,请再补充一下
export CUDA_VISIBLE_DEVICES=0 TRAIN_SET="dureader-retrieval-baseline-dataset/train/dual.train.tsv" MODEL_PATH="pretrained-models/ernie_base_1.0_twin_CN/params" sh script/run_dual_encoder_train.sh $TRAIN_SET $MODEL_PATH 10 1
在第一步的时候运行如上命令时会报错:
OSError: (External) CUBLAS error(7). [Hint: 'CUBLAS_STATUS_INVALID_VALUE'. An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. ] (at /paddle/paddle/fluid/platform/cuda_helper.h:107)
环境如下: cuDNN Version: 7.6. cuda 10.0