Cuda failure 'invalid argument'

FSet89 commented 2 months ago

I'm running finetune_onevision.sh to finetune on my dataset and I get this error:

Traceback (most recent call last): File "/home/ubuntu/LLaVA-NeXT/llava/train/train_mem.py", line 4, in train() File "/home/ubuntu/LLaVA-NeXT/llava/train/train.py", line 1672, in train trainer.train() File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1806, in train return inner_training_loop( File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2150, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 3077, in training_step self.accelerator.backward(loss) File "/opt/conda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward self.engine.backward(loss, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1132, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1483, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1224, in reduce_independent_p_g_buckets_and_remove_grads self.reduce_and_partition_ipg_grads() File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1274, in reduce_and_partition_ipg_grads grad_partitions = self.__avg_scatter_grads(self.params_in_ipg_bucket) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1343, in __avg_scatter_grads grad_partitions_for_rank = reduce_scatter_coalesced(full_grads_for_rank, self.dp_process_group) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 128, in reduce_scatter_coalesced _torch_reduce_scatter_fn(tensor_partition_flat_buffer, File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 23, in _torch_reduce_scatter_fn return instrument_w_nvtx(dist.reduce_scatter_fn)(output_tensor, input_tensor, group=group, async_op=False) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 257, in reduce_scatter_fn return reduce_scatter_tensor(output_tensor, File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, *kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 289, in reduce_scatter_tensor return cdb.reduce_scatter_tensor(output_tensor=output_tensor, File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(args, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 263, in reduce_scatter_tensor return self.reduce_scatter_function(output_tensor, File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3375, in reduce_scatter_tensor work = group._reduce_scatter_base(output, input, opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'invalid argument'

This is the modified script:

export OMP_NUM_THREADS=8
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=ens5
export NCCL_DEBUG=INFO
# TEST
export RANK=0
export PORT=29401
export NNODES=1
export NUM_GPUS=8
export ADDR=0.0.0.0

LLM_VERSION="Qwen/Qwen2-7B-Instruct" 

# for 7b model we recommend bs=1, accum=2, 16 nodes, 128 gpus, lr=1e-5, warmup=0.03
# for 72b model we recommend bs=1, accum=1, 32 nodes, 256 gpus, lr=1e-5, warmup=0.03
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"

############### Pretrain ################

PROMPT_VERSION="qwen_1_5"

BASE_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain"
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"

# TEST
MID_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-mlp2x_gelu-pretrain_blip558k_plain_MID"

CKPT_PATH=$LLM_VERSION # this could also be the previous stage checkpoint

ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
    llava/train/train_mem.py \
    --deepspeed scripts/zero3.json \
    --model_name_or_path ${CKPT_PATH} \
    --version ${PROMPT_VERSION} \
    --data_path /home/ubuntu/llava_finetuning/finetuning.yaml \
    --image_folder /home/ubuntu/llava_finetuning \
    --video_folder /home/ubuntu/llava_finetuning \
    --pretrain_mm_mlp_adapter="/home/ubuntu/LLaVA-NeXT/llava/checkpoints/projectors/${BASE_RUN_NAME}/mm_projector.bin" \
    --mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
    --mm_vision_tower_lr=2e-6 \
    --vision_tower ${VISION_MODEL_VERSION} \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --group_by_modality_length True \
    --image_aspect_ratio anyres_max_9 \
    --image_grid_pinpoints  "(1x1),...,(6x6)" \
    --mm_patch_merge_type spatial_unpad \
    --bf16 True \
    --run_name $MID_RUN_NAME \
    --output_dir "checkpoints/${MID_RUN_NAME}" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 32768 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --torch_compile True \
    --torch_compile_backend "inductor" \
    --dataloader_drop_last True \
    --frames_upbound 32

choiszt commented 2 months ago

Hi, have you solved this problem?

FSet89 commented 2 months ago

No, I switched to LLaVA where I didn't encounter it. However, I hope they fix it

OBJECT907 commented 2 months ago

No, I switched to LLaVA where I didn't encounter it. However, I hope they fix it

What did you mean by "switched to"? Install the training package of LLaVA?

FSet89 commented 2 months ago

Yes, I'm using that repo until this problem is identified/fixed

choiszt commented 2 months ago

I have set export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export NCCL_SOCKET_IFNAME=eth0 export NCCL_SHM_DISABLE=1 It temporarily works for me @FSet89

OBJECT907 commented 2 months ago

I have set export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export NCCL_SOCKET_IFNAME=eth0 export NCCL_SHM_DISABLE=1 It temporarily works for me @FSet89

Not work for me.

lxr-1204 commented 2 months ago

I have solved this issue. For me, this problem was another form of OOM (Out of Memory), and you can solve it by addressing the OOM itself. For example, by adding more GPUs or enabling LoRA. Specifically, you can also reduce the max_length, but this may cause token truncation, so please adjust it based on your dataset. Good luck!

MinusOne2 commented 1 month ago

just comment this...


export OMP_NUM_THREADS=8
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

LLaVA-VL / LLaVA-NeXT

Cuda failure 'invalid argument' #209