OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
12.76k stars 894 forks source link

๐Ÿ’ก [REQUEST]ๅพฎ่ฐƒ่ฟ›ๅบฆ100%ๅŽๆŠฅ้”™ Watchdog caught collective operation timeout: WorkNCCL #533

Closed xiaobaixue closed 2 months ago

xiaobaixue commented 3 months ago

่ฎญ็ปƒ่„šๆœฌ

!/bin/bash

GPUS_PER_NODE=2 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001

MODEL="/root/MiniCPM-V/pretrained_weights/MiniCPM-V-2_6" # or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5

DATA="/root/MiniCPM-V/exam/mllm_demo.json" EVAL_DATA="/root/MiniCPM-V/exam/mllm_demo.json" LLM_TYPE="qwen2" export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --llm_type $LLM_TYPE \ --data_path $DATA \ --eval_data_path $EVAL_DATA \ --remove_unused_columns false \ --label_names "labels" \ --prediction_loss_only false \ --bf16 false \ --bf16_full_eval false \ --fp16 true \ --fp16_full_eval true \ --do_train \ --do_eval \ --tune_vision true \ --tune_llm false \ --use_lora true \ --lora_target_modules "llm..*layers.\d+.self_attn.(q_proj|k_proj|v_proj|o_proj)" \ --model_max_length 2048 \ --max_slice_nums 9 \ --max_steps 10 \ --eval_steps 1000 \ --output_dir output/output__lora \ --logging_dir output/output_lora \ --logging_strategy "steps" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 200 \ --evaluation_strategy "steps" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-6 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --gradient_checkpointing true \ --deepspeed ds_config_zero3_offload.json \ --report_to "tensorboard" # wandb

ๆŠฅ้”™ไฟกๆฏ

100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 20/20 [11:04<00:00, 33.20s/it] [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=52989, OpType=_ALLGATHER_BASE, NumelIn=181190315, NumelOut=543570945, Timeout(ms)=1800000) ran for 1800056 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 52989, last enqueued NCCL work: 52989, last completed NCCL work: 52988. [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=52989, OpType=_ALLGATHER_BASE, NumelIn=181190315, NumelOut=543570945, Timeout(ms)=1800000) ran for 1800056 milliseconds before timing out.

zhaoyangwei123 commented 3 months ago

@xiaobaixue ๆ‚จๅฅฝ๏ผŒ่ฏท้—ฎๆ‚จ่ฟ™้‡Œ็š„ๆ•ฐๆฎ้›†DATA="/root/MiniCPM-V/exam/mllm_demo.json" EVAL_DATA="/root/MiniCPM-V/exam/mllm_demo.json"ๆ˜ฏๆ‚จ่‡ชๅทฑๅˆถไฝœ็š„่ฟ˜ๆ˜ฏๅฎ˜ๆ–น็š„ๅ‘ข๏ผŸ๏ผŒๆˆ‘็œ‹ๅฎ˜ๆ–นๅชๆ˜ฏๅœจswiftๅพฎ่ฐƒ็š„ๆ—ถๅ€™็ป™ไบ†coco-en-mini ๆ•ฐๆฎ้›†๏ผŒไฝ†ๆ˜ฏ่ฏฅๆ•ฐๆฎ้›†ไธ‹่ฝฝไธ‹ๆฅไนŸไธๆ˜ฏjsonๆ ผๅผ๏ผŒไธŽfinetune่ฆๆฑ‚็š„jsonๆ ผๅผไผผไนŽไธไธ€ๆ ทๅ‘ข

xiaobaixue commented 3 months ago

่‡ชๅทฑ็š„ๆ•ฐๆฎ้›†๏ผŒๆŒ‰็…งๅฎ˜ๆ–น็š„่ฎญ็ปƒๆ•™็จ‹่‡ชๅทฑๅˆ›ๅปบ็š„ https://modelbest.feishu.cn/wiki/HvfLwYzlIihqzXkmeCdczs6onmd

xiaobaixue commented 3 months ago

ๅทฒ็ป่งฃๅ†ณไบ†๏ผŒไธคไธช4090็š„ๅกไธ่กŒ๏ผŒๆขไบ†ไธคไธช3090็š„ๅกๅฐฑๅฅฝไฝฟไบ†๏ผŒ่Žซๅๅ…ถๅฆ™ใ€‚