Closed udrs closed 5 months ago
hi we will update the code tto solve the problem . You can change the code by yourself like this: https://github.com/qyc-98/MiniCPM-V/blob/main/finetune/finetune.py#L273
遇到同样问题,按照上述修改仍未解决。
Thank you for your quick reply.
I modify the code as your guidance, and met new issue(we set "scale_resolution" as 256, and our image size is 256,):
ils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [64,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [65,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [66,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [67,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [68,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [69,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [70,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [71,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [72,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [73,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [74,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [75,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [76,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [77,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [78,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [79,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [80,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [81,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [82,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [83,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [84,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [85,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [86,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [87,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [88,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [89,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [90,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [91,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [92,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [93,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [94,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [556,0,0], thread: [95,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
Traceback (most recent call last):
File "/home/ubuntu/MiniCPM-V/finetune/finetune.py", line 334, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
0%| | 0/10000 [00:00<?, ?it/s]
Are there any problems in the json file below?
[
{
"id": "0",
"image": "/home/ubuntu/MiniCPM-V/finetune/haha/xidian.jpg",
"conversations": [
{
"role": "user",
"content": "
issue solved, thank you for your help
exchange to A100
用的最新的代码,问题还在
./aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [3991,0,0], thread: [88,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed
exchange to A100
hi @udrs you mean you fix that by change to gpu A100?
用的最新的代码,问题还在 ./aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [3991,0,0], thread: [88,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed
got same problem, even updata new code
用的最新的代码,问题还在 ./aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [3991,0,0], thread: [88,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failedgot same problem, even updata new code
I also have this issue with the most updated code, is there a way to solve it?
Same problem. Did you guys figure it out?
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
Thank you for your greate job.
My configuration is: MODEL="openbmb/MiniCPM-Llama3-V-2_5" # or openbmb/MiniCPM-V-2
ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
See the section for finetuning in README for more information.
DATA="/home/ubuntu/MiniCPM-V/finetune/haha/xidian.json" EVAL_DATA="/home/ubuntu/MiniCPM-V/finetune/haha/xidian.json" LLM_TYPE="llama3" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm
DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --llm_type $LLM_TYPE \ --data_path $DATA \ --eval_data_path $EVAL_DATA \ --remove_unused_columns false \ --label_names "labels" \ --prediction_loss_only false \ --bf16 false \ --bf16_full_eval false \ --fp16 true \ --fp16_full_eval true \ --do_train \ --do_eval \ --tune_vision false \ --tune_llm false \ --use_lora true \ --lora_target_modules "llm..*layers.\d+.self_attn.(q_proj|k_proj)" \ --model_max_length 128 \ --max_slice_nums 2 \ --scale_resolution 128 \ --max_steps 10000 \ --eval_steps 1000 \ --output_dir output/output_minicpmv2_lora \ --logging_dir output/output_minicpmv2_lora \ --logging_strategy "steps" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "steps" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-6 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --gradient_checkpointing true \ --deepspeed ds_config_zero2.json \ --report_to "tensorboard" # wandb
Below is the bug log:(MiniCPM-V) ubuntu@10-60-22-207:~/MiniCPM-V/finetune$ vim finetune_lora.sh (MiniCPM-V) ubuntu@10-60-22-207:~/MiniCPM-V/finetune$ bash finetune_lora.sh [2024-06-06 03:47:52,725] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [2024-06-06 03:47:53,467] [INFO] [comm.py:637:init_distributed] cdb=None [2024-06-06 03:47:53,467] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl /home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
train()
File "/home/ubuntu/MiniCPM-V/finetune/finetune.py", line 323, in train
trainer.train()
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/MiniCPM-V/finetune/trainer.py", line 203, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ubuntu/MiniCPM-V/finetune/trainer.py", line 28, in compute_loss
outputs = self.model.base_model(data = inputs, use_cache=False)
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
return self.model.forward(args, kwargs)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/openbmb/MiniCPM-Llama3-V-2_5/b9f5fa87759ba195bb866de9ab50510a5fe91bad/modeling_minicpmv.py", line 164, in forward
vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/openbmb/MiniCPM-Llama3-V-2_5/b9f5fa87759ba195bb866de9ab50510a5fe91bad/modeling_minicpmv.py", line 156, in get_vllm_embedding
cur_vllmemb.scatter(0, image_indices.view(-1, 1).repeat(1, cur_vllm_emb.shape[-1]),
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
0%| | 0/10000 [00:00<?, ?it/s]
[2024-06-06 03:48:09,728] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3339) of binary: /home/ubuntu/miniconda3/envs/MiniCPM-V/bin/python
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. warnings.warn( Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 2.24it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Currently using LoRA for fine-tuning the MiniCPM-V model. {'Total': 8564355312, 'Trainable': 116301824} llm_type=llama3 Loading data... max_steps is given, it will override any value given in num_train_epochs Using /home/ubuntu/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.05588507652282715 seconds 0%| | 0/10000 [00:00<?, ?it/s]/home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /home/ubuntu/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( Traceback (most recent call last): File "/home/ubuntu/MiniCPM-V/finetune/finetune.py", line 333, infinetune.py FAILED
Failures: