gaoyuan211 commented 3 years ago

执行eva脚本时，卡顿不执行。以下为日志信息： root@cfea9da46cdd:/mnt/EVA/src/scripts# bash infer_enc_dec_interactive.sh /opt/conda/bin/deepspeed --num_nodes 1 --num_gpus 1 --master_port 4586 --hostfile /mnt/EVA/src/configs/host_files/hostfile /mnt/EVA/src/eva_interactive.py --model-config /mnt/EVA/src/configs/model/eva_model_config.json --model-parallel-size 1 --load /mnt/eva2/ --distributed-backend nccl --no-load-optim --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /mnt/EVA/src/bpe_dialog_new --temperature 0.7 --top_k 0 --top_p 0.9 --fp16 --deepspeed --deepspeed_config /mnt/EVA/src/configs/deepspeed/eva_ds_config.json Warning: Permanently added '175.24.121.148' (ECDSA) to the list of known hosts. root@175.24.121.148's password: /etc/profile.d/lang.sh: line 19: warning: setlocale: LC_CTYPE: cannot change locale (C.UTF-8) [2021-09-14 16:01:54,434] [INFO] [runner.py:283:main] Using IP address of 172.17.0.8 for node 175.24.121.148 [2021-09-14 16:01:54,434] [INFO] [runner.py:355:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxNzUuMjQuMTIxLjE0OCI6IFswXX0= --master_addr=172.17.0.8 --master_port=4586 /mnt/EVA/src/eva_interactive.py --model-config /mnt/EVA/src/configs/model/eva_model_config.json --model-parallel-size 1 --load /mnt/eva2/ --distributed-backend nccl --no-load-optim --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /mnt/EVA/src/bpe_dialog_new --temperature 0.7 --top_k 0 --top_p 0.9 --fp16 --deepspeed --deepspeed_config /mnt/EVA/src/configs/deepspeed/eva_ds_config.json [2021-09-14 16:01:55,162] [INFO] [launch.py:71:main] 0 NCCL_INCLUDE_DIR /usr/include [2021-09-14 16:01:55,162] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.7.8 [2021-09-14 16:01:55,162] [INFO] [launch.py:71:main] 0 NCCL_LIBRARY /usr/lib/x86_64-linux-gnu [2021-09-14 16:01:55,162] [INFO] [launch.py:78:main] WORLD INFO DICT: {'175.24.121.148': [0]} [2021-09-14 16:01:55,162] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=1, node_rank=0 [2021-09-14 16:01:55,162] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'175.24.121.148': [0]}) [2021-09-14 16:01:55,162] [INFO] [launch.py:100:main] dist_world_size=1 [2021-09-14 16:01:55,162] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0 Loading Model ... WARNING: No training data specified using world size: 1 and model-parallel size: 1

using dynamic loss scaling [2021-09-14 16:01:56,117] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl 日志到这里就停下来一直不执行，最后异常退出。

以下为执行脚本： 宿主机src路径 /home/EVA 宿主机模型路径 /home/eva2

sudo docker run --gpus all -ti -v /home:/mnt gyxthu17/eva:1.2 /bin/bash

/mnt/src# bash infer_enc_dec_interactive.sh

infer_enc_dec_interactive.sh 部分配置项： WORKING_DIR=/mnt/EVA

Change for multinode config

MP_SIZE=1

NUM_WORKERS=1 NUM_GPUS_PER_WORKER=1

CONFIG_PATH="${WORKING_DIR}/src/configs/model/eva_model_config.json"

CKPT_PATH="/mnt/eva2/" DS_CONFIG="${WORKING_DIR}/src/configs/deepspeed/eva_ds_config.json" TOKENIZER_PATH="${WORKING_DIR}/src/bpe_dialog_new" HOST_FILE="${WORKING_DIR}/src/configs/host_files/hostfile"

TEMP=0.7

If TOPK/TOPP are 0 it defaults to greedy sampling, top-k will also override top-p

TOPK=0 TOPP=0.9

OPTS="" OPTS+=" --model-config ${CONFIG_PATH}" OPTS+=" --model-parallel-size ${MP_SIZE}" OPTS+=" --load ${CKPT_PATH}" OPTS+=" --distributed-backend nccl" OPTS+=" --no-load-optim" OPTS+=" --weight-decay 1e-2" OPTS+=" --clip-grad 1.0" OPTS+=" --tokenizer-path ${TOKENIZER_PATH}" OPTS+=" --temperature ${TEMP}" OPTS+=" --top_k ${TOPK}" OPTS+=" --top_p ${TOPP}"

请协助定位下问题所在，非常感谢。另外，该正常执行是什么输出或者界面呢、

t1101675 commented 3 years ago

感谢您的关注！您可以先确定一下 torch 是否能在 docker 中正常使用，是否能使用 GPU，docker 是否有 ssh 服务。

t1101675 commented 3 years ago

正常运行最终的输出界面如下，最后会有输入的提示符：

t1101675 commented 3 years ago

另外，麻烦之后提 issue 到 EVA 原本的目录提，https://github.com/thu-coai/EVA，那里的 issue 回复较快，版本也会第一时间更新，谢谢！

gaoyuan211 commented 3 years ago

---原始邮件--- 发件人: @.> 发送时间: 2021年9月15日(周三) 下午5:08 收件人: @.>; 抄送: @.**@.>; 主题: Re: [BAAI-WuDao/EVA] EVA script problem：脚本执行出错，需要协助定位问题所在，谢谢 (#4)

另外，麻烦之后提 issue 到 EVA 原本的目录提，https://github.com/thu-coai/EVA，那里的 issue 回复较快，版本也会第一时间更新，谢谢！

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

gaoyuan211 commented 3 years ago

---原始邮件--- 发件人: @.> 发送时间: 2021年9月15日(周三) 下午5:06 收件人: @.>; 抄送: @.**@.>; 主题: Re: [BAAI-WuDao/EVA] EVA script problem：脚本执行出错，需要协助定位问题所在，谢谢 (#4)

感谢您的关注！您可以先确定一下 torch 是否能在 docker 中正常使用，是否能使用 GPU，docker 是否有 ssh 服务。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

gaoyuan211 commented 3 years ago

确认了。都没问题

---原始邮件--- 发件人: @.> 发送时间: 2021年9月15日(周三) 下午5:06 收件人: @.>; 抄送: @.**@.>; 主题: Re: [BAAI-WuDao/EVA] EVA script problem：脚本执行出错，需要协助定位问题所在，谢谢 (#4)

感谢您的关注！您可以先确定一下 torch 是否能在 docker 中正常使用，是否能使用 GPU，docker 是否有 ssh 服务。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

lianlipen commented 3 years ago

1.在容器中执行了torch.cuda.device_count()，结果是1 2.在容器中也ssh连接了宿主机，也是可以通的。

gaoyuan211 commented 3 years ago

1.在容器中执行了torch.cuda.device_count()，结果是1 2.在容器中也ssh连接了宿主机，也是可以通的。

烦请继续帮忙定位下问题所在

BAAI-WuDao / EVA

EVA script problem：脚本执行出错，需要协助定位问题所在，谢谢 #4

Change for multinode config

If TOPK/TOPP are 0 it defaults to greedy sampling, top-k will also override top-p