Open FSet89 opened 2 months ago
Hi, have you solved this problem?
No, I switched to LLaVA where I didn't encounter it. However, I hope they fix it
No, I switched to LLaVA where I didn't encounter it. However, I hope they fix it
What did you mean by "switched to"? Install the training package of LLaVA?
Yes, I'm using that repo until this problem is identified/fixed
I have set export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export NCCL_SOCKET_IFNAME=eth0 export NCCL_SHM_DISABLE=1
It temporarily works for me @FSet89
I have set
export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export NCCL_SOCKET_IFNAME=eth0 export NCCL_SHM_DISABLE=1
It temporarily works for me @FSet89
Not work for me.
I have solved this issue. For me, this problem was another form of OOM (Out of Memory), and you can solve it by addressing the OOM itself. For example, by adding more GPUs or enabling LoRA. Specifically, you can also reduce the max_length, but this may cause token truncation, so please adjust it based on your dataset. Good luck!
just comment this...
export OMP_NUM_THREADS=8
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
I'm running finetune_onevision.sh to finetune on my dataset and I get this error:
Traceback (most recent call last): File "/home/ubuntu/LLaVA-NeXT/llava/train/train_mem.py", line 4, in
train()
File "/home/ubuntu/LLaVA-NeXT/llava/train/train.py", line 1672, in train
trainer.train()
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1806, in train
return inner_training_loop(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2150, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 3077, in training_step
self.accelerator.backward(loss)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward
self.deepspeed_engine_wrapped.backward(loss, kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(args, kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1132, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1483, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1224, in reduce_independent_p_g_buckets_and_remove_grads
self.reduce_and_partition_ipg_grads()
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, **kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1274, in reduce_and_partition_ipg_grads
grad_partitions = self.__avg_scatter_grads(self.params_in_ipg_bucket)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1343, in __avg_scatter_grads
grad_partitions_for_rank = reduce_scatter_coalesced(full_grads_for_rank, self.dp_process_group)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 128, in reduce_scatter_coalesced
_torch_reduce_scatter_fn(tensor_partition_flat_buffer,
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/comm/coalesced_collectives.py", line 23, in _torch_reduce_scatter_fn
return instrument_w_nvtx(dist.reduce_scatter_fn)(output_tensor, input_tensor, group=group, async_op=False)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 257, in reduce_scatter_fn
return reduce_scatter_tensor(output_tensor,
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, *kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 289, in reduce_scatter_tensor
return cdb.reduce_scatter_tensor(output_tensor=output_tensor,
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(args, kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 263, in reduce_scatter_tensor
return self.reduce_scatter_function(output_tensor,
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3375, in reduce_scatter_tensor
work = group._reduce_scatter_base(output, input, opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'
This is the modified script: