Coobiw / MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
382 stars 20 forks source link

分布式设置错误 #19

Open WeiminLee opened 6 months ago

WeiminLee commented 6 months ago

请问代码一直卡在了torch.distributed.init_process_group 这个方法,请问如何解决?

环境信息:单击多卡 image

OS 设置: os.environ['RANK'] = '0' os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡 os.environ['LOCAL_RANK'] = '0' os.environ['MASTER_ADDR'] = '127.0.0.1' # rank0 对应的地址 os.environ['MASTER_PORT'] = '29500' # 任何空闲的端口 os.environ['NCCL_IB_DISABLE'] = "1" os.environ['NCCL_IBEXT_DISABLE'] = "1"

下面的代码一直超时,请问是哪里设置错误了么?

args.dist_url: "env://" args.dist_backend = "nccl"

torch.distributed.init_process_group( backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank, timeout=datetime.timedelta( seconds=10 ), # allow auto-downloading and de-compressing ) torch.distributed.barrier()

Coobiw commented 6 months ago

os.environ['RANK'] = '0' os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡 os.environ['LOCAL_RANK'] = '0'

不要指定三项

如果只用四张卡 直接运行时加入CUDA_VISIBLE_DEVICES=x,x,x,x

如果还是超时,请删除掉两个关于NCCL的环境变量