baaivision / EVA

EVA Series: Visual Representation Fantasies from BAAI
MIT License
2.24k stars 165 forks source link

the .sh script of Evaluate the fine-tuned EVA (336px, patch_size=14) on ImageNet-1K val with a single node (click to expand) can not execute. #118

Open peter-ni-noob opened 1 year ago

peter-ni-noob commented 1 year ago

(eva) root@nexus-nyz:~/EVA/EVA-01/eva# bash eva.sh /root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last): File "/root/miniconda3/envs/eva/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/eva/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 212, in launch_agent master_addr, master_port = _get_addr_and_port(rdzv_parameters) File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 167, in _get_addr_and_port master_addr, master_port = parse_rendezvous_endpoint(endpoint, default_port=-1) File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 102, in parse_rendezvous_endpoint raise ValueError( ValueError: The hostname of the rendezvous endpoint ':12355' must be a dot-separated list of labels, an IPv4 address, or an IPv6 address.

peter-ni-noob commented 1 year ago

MODEL_NAME=eva_g_patch14

sz=336 batch_size=16 crop_pct=1.0

EVAL_CKPT=/path/to/eva_21k_1k_336px_psz14_ema_89p6.pt # https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_336px_psz14_ema_89p6.pt

DATA_PATH=/data_gs/imagenet NNODES=1 NODE_RANK=0 MASTER_ADDR=127.0.0.1 python -m torch.distributed.launch --nproc_per_node=7 --nnodes=$NNODES --node_rank=$NODE_RANK \ --master_addr=$MASTER_ADDR --master_port=12355 --use_env run_class_finetuning.py \ --data_path ${DATA_PATH}/train \ --eval_data_path ${DATA_PATH}/val \ --nb_classes 1000 \ --data_set image_folder \ --model ${MODEL_NAME} \ --finetune ${EVAL_CKPT} \ --input_size ${sz} \ --batch_size ${batch_size} \ --crop_pct ${crop_pct} \ --no_auto_resume \ --dist_eval \ --eval \ --enable_deepspeed the code above is eva.sh,I have 1 machine with 7 gpus,that's how i config.but now turn out bugs(RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.)