BAAI-WuDao / EVA

Other
25 stars 4 forks source link

No training data specified #5

Open lianlipen opened 3 years ago

lianlipen commented 3 years ago

执行命令: root@19ddc6805219:/mnt/src/scripts# bash infer_enc_dec_interactive.sh

结果: /opt/conda/bin/deepspeed --num_nodes 1 --num_gpus 1 --master_port 4586 --hostfile /mnt/src/configs/host_files/hostfile /mnt/src/eva_interactive.py --model-config /mnt/src/configs/model/eva_model_config.json --model-parallel-size 1 --load /mnt/src/eva2 --distributed-backend nccl --no-load-optim --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /mnt/src/bpe_dialog_new --temperature 0.7 --top_k 0 --top_p 0.9 --fp16 --deepspeed --deepspeed_config /mnt/src/configs/deepspeed/eva_ds_config.json Warning: Permanently added '175.24.121.148' (ECDSA) to the list of known hosts. root@175.24.121.148's password: /etc/profile.d/lang.sh: line 19: warning: setlocale: LC_CTYPE: cannot change locale (C.UTF-8) [2021-09-15 01:39:18,377] [INFO] [runner.py:283:main] Using IP address of 172.17.0.8 for node 175.24.121.148 [2021-09-15 01:39:18,377] [INFO] [runner.py:355:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxNzUuMjQuMTIxLjE0OCI6IFswXX0= --master_addr=172.17.0.8 --master_port=4586 /mnt/src/eva_interactive.py --model-config /mnt/src/configs/model/eva_model_config.json --model-parallel-size 1 --load /mnt/src/eva2 --distributed-backend nccl --no-load-optim --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /mnt/src/bpe_dialog_new --temperature 0.7 --top_k 0 --top_p 0.9 --fp16 --deepspeed --deepspeed_config /mnt/src/configs/deepspeed/eva_ds_config.json [2021-09-15 01:39:19,074] [INFO] [launch.py:71:main] 0 NCCL_INCLUDE_DIR /usr/include [2021-09-15 01:39:19,074] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.7.8 [2021-09-15 01:39:19,074] [INFO] [launch.py:71:main] 0 NCCL_LIBRARY /usr/lib/x86_64-linux-gnu [2021-09-15 01:39:19,074] [INFO] [launch.py:78:main] WORLD INFO DICT: {'175.24.121.148': [0]} [2021-09-15 01:39:19,074] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=1, node_rank=0 [2021-09-15 01:39:19,074] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'175.24.121.148': [0]}) [2021-09-15 01:39:19,075] [INFO] [launch.py:100:main] dist_world_size=1 [2021-09-15 01:39:19,075] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0 Loading Model ... WARNING: No training data specified using world size: 1 and model-parallel size: 1

using dynamic loss scaling [2021-09-15 01:39:19,975] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl

这里好像模型没有加载,是什么情况了?谢谢了!

lianlipen commented 3 years ago

deepspeed.init_distributed()代码执行到这一行,就卡住了,是什么情况了?