[Usage] The training always stuck after formatting inputs

detectRecog commented 6 months ago

Describe the issue

Issue: In pretraining or finetuning, the training always stuck after the log "Formatting inputs...Skip in lazy mode". Everytime I need to force shutting down my GPU server because it was out-of-response. Both the GUI and the SSH does not response at all.

Command (scripts/v1/phi2/pretrain.sh):

deepspeed --num_gpus=2 moellava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path microsoft/phi-2 \
    --version plain \
    --data_path ${JSON_FOLDER}/llava_image_.json \
    --image_folder ${IMAGE_FOLDER} \
    --image_tower google/siglip-so400m-patch14-384 \
    --image_projector_type mlp2x_gelu \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/llavaphi-2.7b-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 0 \
    --lazy_preprocess True \
    --cache_dir ${CACHE_FOLDER}

When using the zero2.json, the log stops here Log:

    (mm_projector): build_projector(
      (image_spatial_proj): Sequential(
        (0): Linear(in_features=1024, out_features=2560, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=2560, out_features=2560, bias=True)
      )
      (video_patch_proj): Identity()
      (video_spatial_proj): Identity()
      (video_temproal_proj): Identity()
      (video_global_proj): Identity()
    )
  )
  (lm_head): Linear(in_features=2560, out_features=51200, bias=False)
)
Formatting inputs...Skip in lazy mode

When using the zero2_offload.json, the log stops here Log:

    (mm_projector): build_projector(
      (image_spatial_proj): Sequential(
        (0): Linear(in_features=1024, out_features=2560, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=2560, out_features=2560, bias=True)
      )
      (video_patch_proj): Identity()
      (video_spatial_proj): Identity()
      (video_temproal_proj): Identity()
      (video_global_proj): Identity()
    )
  )
  (lm_head): Linear(in_features=2560, out_features=51200, bias=False)
)
Formatting inputs...Skip in lazy mode
Using /home/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Creating extension directory /home/xxx/.cache/torch_extensions/py310_cu117/cpu_adam...
Using /home/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xxx/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -D ...
[2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam ...
[3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam ...
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o - ...
Loading extension module cpu_adam...
Time to load cpu_adam op: 39.41885256767273 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 39.50555872917175 seconds

I tried to remove folders under ~/.cache/torch_extensions, the stuck still happens.

My machine has 10*RTX 3090 and RAM (512G to 1T). In the above command, I just try to use 2 cards. As the machine breaks down everytime I start training, I have no idea what was wrong. Please help~

The environment exactly follows the instructions except for two additional commands "pip install deepspeed -U" and "pip install accelerate -U". I updated these two packages when I tried to solve the same problem.

Screenshots:

LinB203 commented 6 months ago

Can it work well with inferencing? https://github.com/PKU-YuanGroup/MoE-LLaVA?tab=readme-ov-file#cli-inference

detectRecog commented 6 months ago

@LinB203 Inference codes are OK. I tried "LanguageBind/MoE-LLaVA-Phi2-2.7B-4e".

LinB203 commented 6 months ago

Could you try use deepspeed --include localhost:0,1 moellava/train/train_mem.py to specify 2 gpus? Or just use 1 gpu deepspeed --include localhost:0 moellava/train/train_mem.py? What's your deepspeed version? Is 0.9.5?

detectRecog commented 6 months ago

When I used one card as deepspeed --include localhost:0 moellava/train/train_mem.py, the training begins as magic.

However, when I switched to multiple cards (2 or 4 or 8) such as "deepspeed --include localhost:0,1 moellava/train/train_mem.py", it will stuck after Formatting inputs...Skip in lazy mode without any logs.

I test two deepspeed versions '0.9.5' and '0.13.2'. The results are the same. @LinB203

detectRecog commented 6 months ago

After exhaustive search on similar problems, I find that add export NCCL_P2P_DISABLE=1 can solve the problem!

However, I still wonder: 1.Why this solves the problem? 2.Does this env variable affects the training efficiency or performance?

What do you think? @LinB203

LinB203 commented 6 months ago

经过对类似问题的穷举查找，发现addexport NCCL_P2P_DISABLE=1可以解决问题！

但是，我仍然想知道： 1.为什么这样可以解决问题？ 2.这个env变量会影响训练效率或性能吗？

你怎么认为？ @LinB203

I have had this problem on other machines and solved it with this method. This environment variable is just a mode to enable multi-GPU communication as you can run on a single gpu but fail on multi-gpus. It does not affect performance.

detectRecog commented 6 months ago

OK, thanks a lot! Wish my experiences can help others!

PKU-YuanGroup / MoE-LLaVA

[Usage] The training always stuck after formatting inputs #41

Describe the issue