RuntimeError: CUDA error: invalid device ordinal

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Trying to fine tune a model using sample code provided finetune_lora.sh, finetune.py, dataset.py, trainer.py provided in your github repository.

I have set export CUDA_VISIBLE_DEVICES=0,1

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ root@584a1774aaad:/# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 root@584a1774aaad:/#

root@584a1774aaad:/workspace/minicpm/finetune# ./finetune_lora.sh W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] [2024-07-20 21:08:35,215] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,216] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,233] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,233] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,234] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,239] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,270] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,270] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( [2024-07-20 21:08:49,356] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,356] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,357] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,357] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,358] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None rank5: Traceback (most recent call last): rank5: File "/workspace/minicpm/finetune/finetune.py", line 281, in

rank5: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank5: ) = parser.parse_args_into_dataclasses() rank5: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank5: obj = dtype(**inputs) rank5: File "", line 136, in init rank5: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__

rank5: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank5: return self._setup_devices rank5: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank5: cached = self.fget(obj) rank5: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank5: self.distributed_state = PartialState(**accelerator_state_kwargs) rank5: File "/workspace/pypacks/accelerate/state.py", line 280, in init

rank5: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device

rank5: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device

rank5: RuntimeError: CUDA error: invalid device ordinal rank5: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank5: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank5: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

rank3: Traceback (most recent call last): rank3: File "/workspace/minicpm/finetune/finetune.py", line 281, in

rank3: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank3: ) = parser.parse_args_into_dataclasses() rank3: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank3: obj = dtype(**inputs) rank3: File "", line 136, in init rank3: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__

rank3: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank3: return self._setup_devices rank3: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank3: cached = self.fget(obj) rank3: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank3: self.distributed_state = PartialState(**accelerator_state_kwargs) rank3: File "/workspace/pypacks/accelerate/state.py", line 280, in init

rank3: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device

rank3: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device

rank3: RuntimeError: CUDA error: invalid device ordinal rank3: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank3: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank3: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

rank6: Traceback (most recent call last): rank6: File "/workspace/minicpm/finetune/finetune.py", line 281, in

rank6: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank6: ) = parser.parse_args_into_dataclasses() rank6: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank6: obj = dtype(**inputs) rank6: File "", line 136, in init rank6: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__

rank6: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank6: return self._setup_devices rank6: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank6: cached = self.fget(obj) rank6: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank6: self.distributed_state = PartialState(**accelerator_state_kwargs) rank6: File "/workspace/pypacks/accelerate/state.py", line 280, in init

rank6: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device

rank6: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device

rank6: RuntimeError: CUDA error: invalid device ordinal rank6: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank6: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank6: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

rank2: Traceback (most recent call last): rank2: File "/workspace/minicpm/finetune/finetune.py", line 281, in

rank2: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank2: ) = parser.parse_args_into_dataclasses() rank2: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank2: obj = dtype(**inputs) rank2: File "", line 136, in init rank2: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__

rank2: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank2: return self._setup_devices rank2: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank2: cached = self.fget(obj) rank2: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank2: self.distributed_state = PartialState(**accelerator_state_kwargs) rank2: File "/workspace/pypacks/accelerate/state.py", line 280, in init

rank2: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device

rank2: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device

rank2: RuntimeError: CUDA error: invalid device ordinal rank2: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank2: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank2: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

rank4: Traceback (most recent call last): rank4: File "/workspace/minicpm/finetune/finetune.py", line 281, in

rank4: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank4: ) = parser.parse_args_into_dataclasses() rank4: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank4: obj = dtype(**inputs) rank4: File "", line 136, in init rank4: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__

rank4: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank4: return self._setup_devices rank4: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank4: cached = self.fget(obj) rank4: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank4: self.distributed_state = PartialState(**accelerator_state_kwargs) rank4: File "/workspace/pypacks/accelerate/state.py", line 280, in init

rank4: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device

rank4: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device

rank4: RuntimeError: CUDA error: invalid device ordinal rank4: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank4: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank4: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

rank7: Traceback (most recent call last): rank7: File "/workspace/minicpm/finetune/finetune.py", line 281, in

rank7: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank7: ) = parser.parse_args_into_dataclasses() rank7: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank7: obj = dtype(**inputs) rank7: File "", line 136, in init rank7: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__

rank7: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank7: return self._setup_devices rank7: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank7: cached = self.fget(obj) rank7: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank7: self.distributed_state = PartialState(**accelerator_state_kwargs) rank7: File "/workspace/pypacks/accelerate/state.py", line 280, in init

rank7: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device

rank7: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device

rank7: RuntimeError: CUDA error: invalid device ordinal rank7: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank7: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank7: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

config.json: 100%|████████████████████████████████████████████████████████| 1.37k/1.37k [00:00<00:00, 6.74MB/s] configuration_minicpm.py: 100%|███████████████████████████████████████████| 4.06k/4.06k [00:00<00:00, 20.8MB/s] A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:

configuration_minicpm.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
configuration_minicpm.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. modeling_minicpmv.py: 100%|███████████████████████████████████████████████| 25.0k/25.0k [00:00<00:00, 58.2MB/s] resampler.py: 100%|███████████████████████████████████████████████████████| 35.8k/35.8k [00:00<00:00, 80.6MB/s] A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
resampler.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
modeling_minicpmv.py
resampler.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
resampler.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
modeling_minicpmv.py
resampler.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. W0720 21:08:52.136000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1751 closing signal SIGTERM W0720 21:08:52.137000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1752 closing signal SIGTERM E0720 21:08:52.466000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 2 (pid: 1753) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/workspace/pypacks/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/workspace/pypacks/torch/distributed/run.py", line 879, in main run(args) File "/workspace/pypacks/torch/distributed/run.py", line 870, in run elastic_launch( File "/workspace/pypacks/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/workspace/pypacks/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures: [1]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 3 (local_rank: 3) exitcode : 1 (pid: 1754) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 4 (local_rank: 4) exitcode : 1 (pid: 1755) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 5 (local_rank: 5) exitcode : 1 (pid: 1756) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 6 (local_rank: 6) exitcode : 1 (pid: 1757) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 7 (local_rank: 7) exitcode : 1 (pid: 1758) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 2 (local_rank: 2) exitcode : 1 (pid: 1753) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

root@584a1774aaad:/workspace/minicpm/finetune#

期望行为 | Expected Behavior

Finetuning to not have CUDA errors

复现方法 | Steps To Reproduce

Take a few rows and images per the vl_finetune_data.json and try to run ./finetune_lora.sh and run on unbuntu

运行环境 | Environment

- OS:ubuntu 22.04
- Python:3.10.12 
- Transformers:4.42.4
- PyTorch:2.3.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

No response

OpenBMB / MiniCPM-V