DuReader-Retrieval-Baseline 单卡运行报错

sunxiaojie99 commented 2 years ago

export CUDA_VISIBLE_DEVICES=0 TRAIN_SET="dureader-retrieval-baseline-dataset/train/dual.train.tsv" MODEL_PATH="pretrained-models/ernie_base_1.0_twin_CN/params" sh script/run_dual_encoder_train.sh $TRAIN_SET $MODEL_PATH 10 1

在第一步的时候运行如上命令时会报错：

OSError: (External) CUBLAS error(7). [Hint: 'CUBLAS_STATUS_INVALID_VALUE'. An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. ] (at /paddle/paddle/fluid/platform/cuda_helper.h:107)

环境如下： cuDNN Version: 7.6. cuda 10.0

sunxiaojie99 commented 2 years ago

如果用2个卡的话 train.log 显示如下信息，也无法运行：

INFO 2022-04-17 10:29:54,201 launch.py:311] Local processes completed. /data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/clip.py:697: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'. warnings.warn("Caution! 'set_gradient_clip' is not recommended " /data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/contrib/mixed_precision/decorator.py:447: UserWarning: The decorated optimizer has its own minimize method, but it will not be executed. "The decorated optimizer has its own minimize method, but it will not be executed." /data/home/anaconda3/envs/Casrel/lib/python3.6/site-packages/paddle/fluid/incubate/fleet/collective/init.py:394: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead "set use_hierarchical_allreduce=False since you only have 1 node." [WARNING] 2022-04-17 10:29:42,526 [ init.py: 394]: set use_hierarchical_allreduce=False since you only have 1 node.

API is deprecated since 2.0.0 Please use FleetAPI instead. WIKI: https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler

server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:50132'] server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:50132']

quyingqi commented 2 years ago

对于dual-encoder的单卡训练，需要修改script/run_dual_encoder_train.sh，将其中的use_cross_batch设为False （见readme中：For single-gpu training, please turn off the option use_cross_batch in script/run_dual_encoder_train.sh.）

对于多卡训练，给出的这部分log是正常的，没有出现错误信息。正常情况下在出现： server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:50132'] 这个日志后，稍等一会是能够正常开始训练的。如果后面有额外的错误，请再补充一下

PaddlePaddle / RocketQA

DuReader-Retrieval-Baseline 单卡运行报错 #24