Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.61k stars 167 forks source link

stuck on torch.distributed.barrier() #195

Open bibibabibo26 opened 2 months ago

bibibabibo26 commented 2 months ago

the message is: cd /amax/yt26/VCM/LLaMA2-Accessory ; /amax/yt26/.conda/envs/accessory/bin/python /amax/yt26/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 58291 -- /amax/yt26/.conda/envs/accessory/bin/torchrun --master_port 1112 --nproc_per_node 2 /amax/yt26/VCM/LLaMA2-Accessory/accessory/main_finetune.py --output_dir output_dir/finetune/mm/alpacaLlava_llamaQformerv2_7B --epochs 3 --warmup_epochs 0.2 --batch_size 4 --accum_iter 2 --num_workers 16 --max_words 512 --lr 0.00003 --min_lr 0.000005 --clip_grad 2 --weight_decay 0.02 --data_parallel fsdp --model_parallel_size 2 --checkpointing --llama_type llama_qformerv2_peft --llama_config checkpoint/mm/alpacaLlava_llamaQformerv2/7B_params.json accessory/configs/model/finetune/sg/llamaPeft_normBiasLora.json --tokenizer_path checkpoint/mm/alpacaLlava_llamaQformerv2/tokenizer.model --pretrained_path checkpoint/mm/alpacaLlava_llamaQformerv2 --pretrained_type consolidated --data_config accessory/configs/data/finetune/mm/alpaca_llava_copy.yaml WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/amax/yt26/VCM/LLaMA2-Accessory/accessory/main_finetune.py:41: UserWarning: cannot import FusedAdam from apex, use torch AdamW instead warnings.warn("cannot import FusedAdam from apex, use torch AdamW instead") /amax/yt26/VCM/LLaMA2-Accessory/accessory/main_finetune.py:41: UserWarning: cannot import FusedAdam from apex, use torch AdamW instead warnings.warn("cannot import FusedAdam from apex, use torch AdamW instead") | distributed init (rank 0): env://, gpu 0 | distributed init (rank 1): env://, gpu 1

and the program stuck on here. when the program debug and it stuck on the misc.py line 145 torch.distributed.barrier(). How can i deal with that?