Closed sunxiaojie99 closed 2 years ago
如果用2个卡的话 train.log 显示如下信息,也无法运行:
INFO 2022-04-17 10:29:54,201] Local processes completed.
/data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/ UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
warnings.warn("Caution! 'set_gradient_clip' is not recommended "
/data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/contrib/mixed_precision/ UserWarning: The decorated optimizer has its own minimize
method, but it will not be executed.
"The decorated optimizer has its own minimize
method, but it will not be executed."
/data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/incubate/fleet/collective/ DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
"set use_hierarchical_allreduce=False since you only have 1 node."
[WARNING] 2022-04-17 10:29:42,526 [ 394]: set use_hierarchical_allreduce=False since you only have 1 node.
API is deprecated since 2.0.0 Please use FleetAPI instead. WIKI:
server not ready, wait 3 sec to retry... not ready endpoints:[''] server not ready, wait 3 sec to retry... not ready endpoints:['']
对于dual-encoder的单卡训练,需要修改script/,将其中的use_cross_batch设为False (见readme中:For single-gpu training, please turn off the option use_cross_batch in script/
对于多卡训练,给出的这部分log是正常的,没有出现错误信息。正常情况下在出现: server not ready, wait 3 sec to retry... not ready endpoints:[''] 这个日志后,稍等一会是能够正常开始训练的。如果后面有额外的错误,请再补充一下
export CUDA_VISIBLE_DEVICES=0 TRAIN_SET="dureader-retrieval-baseline-dataset/train/dual.train.tsv" MODEL_PATH="pretrained-models/ernie_base_1.0_twin_CN/params" sh script/ $TRAIN_SET $MODEL_PATH 10 1
OSError: (External) CUBLAS error(7). [Hint: 'CUBLAS_STATUS_INVALID_VALUE'. An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. ] (at /paddle/paddle/fluid/platform/cuda_helper.h:107)
环境如下: cuDNN Version: 7.6. cuda 10.0