Open peter-ni-noob opened 1 year ago
MODEL_NAME=eva_g_patch14
sz=336 batch_size=16 crop_pct=1.0
EVAL_CKPT=/path/to/eva_21k_1k_336px_psz14_ema_89p6.pt # https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_336px_psz14_ema_89p6.pt
DATA_PATH=/data_gs/imagenet NNODES=1 NODE_RANK=0 MASTER_ADDR=127.0.0.1 python -m torch.distributed.launch --nproc_per_node=7 --nnodes=$NNODES --node_rank=$NODE_RANK \ --master_addr=$MASTER_ADDR --master_port=12355 --use_env run_class_finetuning.py \ --data_path ${DATA_PATH}/train \ --eval_data_path ${DATA_PATH}/val \ --nb_classes 1000 \ --data_set image_folder \ --model ${MODEL_NAME} \ --finetune ${EVAL_CKPT} \ --input_size ${sz} \ --batch_size ${batch_size} \ --crop_pct ${crop_pct} \ --no_auto_resume \ --dist_eval \ --eval \ --enable_deepspeed the code above is eva.sh,I have 1 machine with 7 gpus,that's how i config.but now turn out bugs(RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.)
(eva) root@nexus-nyz:~/EVA/EVA-01/eva# bash eva.sh /root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last): File "/root/miniconda3/envs/eva/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/eva/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 212, in launch_agent
master_addr, master_port = _get_addr_and_port(rdzv_parameters)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 167, in _get_addr_and_port
master_addr, master_port = parse_rendezvous_endpoint(endpoint, default_port=-1)
File "/root/miniconda3/envs/eva/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 102, in parse_rendezvous_endpoint
raise ValueError(
ValueError: The hostname of the rendezvous endpoint ':12355' must be a dot-separated list of labels, an IPv4 address, or an IPv6 address.