+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@584a1774aaad:/# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
root@584a1774aaad:/#
root@584a1774aaad:/workspace/minicpm/finetune# ./finetune_lora.sh
W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757]
W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757]
W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757]
[2024-07-20 21:08:35,215] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-20 21:08:35,216] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-20 21:08:35,233] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-20 21:08:35,233] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-20 21:08:35,234] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-20 21:08:35,239] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-20 21:08:35,270] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-20 21:08:35,270] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
/workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
/workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
/workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
/workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
/workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
/workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
/workspace/pypacks/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
[2024-07-20 21:08:49,356] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-20 21:08:49,356] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-20 21:08:49,357] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-20 21:08:49,357] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-20 21:08:49,358] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None
rank5: Traceback (most recent call last):
rank5: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank5: File "/workspace/minicpm/finetune/finetune.py", line 162, in train
rank5: ) = parser.parse_args_into_dataclasses()
rank5: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
rank5: obj = dtype(**inputs)
rank5: File "", line 136, in initrank5: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank5: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device
rank5: return self._setup_devices
rank5: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in getrank5: cached = self.fget(obj)
rank5: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices
rank5: self.distributed_state = PartialState(**accelerator_state_kwargs)
rank5: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank5: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank5: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank5: RuntimeError: CUDA error: invalid device ordinal
rank5: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank5: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
rank5: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
rank3: Traceback (most recent call last):
rank3: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank3: File "/workspace/minicpm/finetune/finetune.py", line 162, in train
rank3: ) = parser.parse_args_into_dataclasses()
rank3: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
rank3: obj = dtype(**inputs)
rank3: File "", line 136, in initrank3: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank3: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device
rank3: return self._setup_devices
rank3: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in getrank3: cached = self.fget(obj)
rank3: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices
rank3: self.distributed_state = PartialState(**accelerator_state_kwargs)
rank3: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank3: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank3: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank3: RuntimeError: CUDA error: invalid device ordinal
rank3: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank3: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
rank3: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
rank6: Traceback (most recent call last):
rank6: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank6: File "/workspace/minicpm/finetune/finetune.py", line 162, in train
rank6: ) = parser.parse_args_into_dataclasses()
rank6: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
rank6: obj = dtype(**inputs)
rank6: File "", line 136, in initrank6: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank6: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device
rank6: return self._setup_devices
rank6: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in getrank6: cached = self.fget(obj)
rank6: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices
rank6: self.distributed_state = PartialState(**accelerator_state_kwargs)
rank6: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank6: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank6: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank6: RuntimeError: CUDA error: invalid device ordinal
rank6: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank6: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
rank6: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
rank2: Traceback (most recent call last):
rank2: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank2: File "/workspace/minicpm/finetune/finetune.py", line 162, in train
rank2: ) = parser.parse_args_into_dataclasses()
rank2: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
rank2: obj = dtype(**inputs)
rank2: File "", line 136, in initrank2: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank2: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device
rank2: return self._setup_devices
rank2: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in getrank2: cached = self.fget(obj)
rank2: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices
rank2: self.distributed_state = PartialState(**accelerator_state_kwargs)
rank2: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank2: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank2: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank2: RuntimeError: CUDA error: invalid device ordinal
rank2: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank2: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
rank2: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
rank4: Traceback (most recent call last):
rank4: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank4: File "/workspace/minicpm/finetune/finetune.py", line 162, in train
rank4: ) = parser.parse_args_into_dataclasses()
rank4: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
rank4: obj = dtype(**inputs)
rank4: File "", line 136, in initrank4: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank4: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device
rank4: return self._setup_devices
rank4: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in getrank4: cached = self.fget(obj)
rank4: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices
rank4: self.distributed_state = PartialState(**accelerator_state_kwargs)
rank4: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank4: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank4: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank4: RuntimeError: CUDA error: invalid device ordinal
rank4: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank4: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
rank4: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
rank7: Traceback (most recent call last):
rank7: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank7: File "/workspace/minicpm/finetune/finetune.py", line 162, in train
rank7: ) = parser.parse_args_into_dataclasses()
rank7: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
rank7: obj = dtype(**inputs)
rank7: File "", line 136, in initrank7: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank7: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device
rank7: return self._setup_devices
rank7: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in getrank7: cached = self.fget(obj)
rank7: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices
rank7: self.distributed_state = PartialState(**accelerator_state_kwargs)
rank7: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank7: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank7: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank7: RuntimeError: CUDA error: invalid device ordinal
rank7: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank7: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
rank7: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
config.json: 100%|████████████████████████████████████████████████████████| 1.37k/1.37k [00:00<00:00, 6.74MB/s]
configuration_minicpm.py: 100%|███████████████████████████████████████████| 4.06k/4.06k [00:00<00:00, 20.8MB/s]
A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
configuration_minicpm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
configuration_minicpm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_minicpmv.py: 100%|███████████████████████████████████████████████| 25.0k/25.0k [00:00<00:00, 58.2MB/s]
resampler.py: 100%|███████████████████████████████████████████████████████| 35.8k/35.8k [00:00<00:00, 80.6MB/s]
A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
resampler.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
modeling_minicpmv.py
resampler.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
resampler.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
modeling_minicpmv.py
resampler.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
W0720 21:08:52.136000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1751 closing signal SIGTERM
W0720 21:08:52.137000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1752 closing signal SIGTERM
E0720 21:08:52.466000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 2 (pid: 1753) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/workspace/pypacks/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/workspace/pypacks/torch/distributed/run.py", line 879, in main
run(args)
File "/workspace/pypacks/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/workspace/pypacks/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/pypacks/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
Trying to fine tune a model using sample code provided finetune_lora.sh, finetune.py, dataset.py, trainer.py provided in your github repository.
I have set export CUDA_VISIBLE_DEVICES=0,1
root@584a1774aaad:/# nvidia-smi
Sat Jul 20 21:07:53 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:0A:00.0 Off | 0 | | N/A 32C P0 61W / 400W | 4MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:44:00.0 Off | 0 | | N/A 33C P0 61W / 400W | 4MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ root@584a1774aaad:/# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 root@584a1774aaad:/#
root@584a1774aaad:/workspace/minicpm/finetune# ./finetune_lora.sh W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0720 21:08:27.097000 139678581792768 torch/distributed/run.py:757] [2024-07-20 21:08:35,215] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,216] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,233] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,233] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,234] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,239] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,270] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-20 21:08:35,270] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality fromtorchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality fromtorchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality fromtorchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality fromtorchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality fromtorchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality fromtorchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality fromtorchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning:evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning:evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning:evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning:evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning:evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning:evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning:evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
instead warnings.warn( /workspace/pypacks/transformers/training_args.py:1494: FutureWarning:evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Useeval_strategy
instead warnings.warn( [2024-07-20 21:08:49,356] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,356] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,357] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,357] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-20 21:08:49,358] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-07-20 21:08:49,358] [INFO] [comm.py:637:init_distributed] cdb=None rank5: Traceback (most recent call last): rank5: File "/workspace/minicpm/finetune/finetune.py", line 281, inrank5: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank5: ) = parser.parse_args_into_dataclasses() rank5: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank5: obj = dtype(**inputs) rank5: File "", line 136, in init
rank5: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank5: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank5: return self._setup_devices rank5: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank5: cached = self.fget(obj) rank5: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank5: self.distributed_state = PartialState(**accelerator_state_kwargs) rank5: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank5: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank5: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank5: RuntimeError: CUDA error: invalid device ordinal rank5: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank5: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank5: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.rank3: Traceback (most recent call last): rank3: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank3: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank3: ) = parser.parse_args_into_dataclasses() rank3: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank3: obj = dtype(**inputs) rank3: File "", line 136, in init
rank3: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank3: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank3: return self._setup_devices rank3: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank3: cached = self.fget(obj) rank3: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank3: self.distributed_state = PartialState(**accelerator_state_kwargs) rank3: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank3: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank3: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank3: RuntimeError: CUDA error: invalid device ordinal rank3: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank3: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank3: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.rank6: Traceback (most recent call last): rank6: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank6: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank6: ) = parser.parse_args_into_dataclasses() rank6: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank6: obj = dtype(**inputs) rank6: File "", line 136, in init
rank6: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank6: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank6: return self._setup_devices rank6: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank6: cached = self.fget(obj) rank6: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank6: self.distributed_state = PartialState(**accelerator_state_kwargs) rank6: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank6: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank6: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank6: RuntimeError: CUDA error: invalid device ordinal rank6: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank6: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank6: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.rank2: Traceback (most recent call last): rank2: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank2: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank2: ) = parser.parse_args_into_dataclasses() rank2: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank2: obj = dtype(**inputs) rank2: File "", line 136, in init
rank2: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank2: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank2: return self._setup_devices rank2: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank2: cached = self.fget(obj) rank2: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank2: self.distributed_state = PartialState(**accelerator_state_kwargs) rank2: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank2: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank2: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank2: RuntimeError: CUDA error: invalid device ordinal rank2: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank2: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank2: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.rank4: Traceback (most recent call last): rank4: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank4: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank4: ) = parser.parse_args_into_dataclasses() rank4: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank4: obj = dtype(**inputs) rank4: File "", line 136, in init
rank4: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank4: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank4: return self._setup_devices rank4: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank4: cached = self.fget(obj) rank4: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank4: self.distributed_state = PartialState(**accelerator_state_kwargs) rank4: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank4: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank4: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank4: RuntimeError: CUDA error: invalid device ordinal rank4: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank4: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank4: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.rank7: Traceback (most recent call last): rank7: File "/workspace/minicpm/finetune/finetune.py", line 281, in
rank7: File "/workspace/minicpm/finetune/finetune.py", line 162, in train rank7: ) = parser.parse_args_into_dataclasses() rank7: File "/workspace/pypacks/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank7: obj = dtype(**inputs) rank7: File "", line 136, in init
rank7: File "/workspace/pypacks/transformers/training_args.py", line 1693, in __post_init__
rank7: File "/workspace/pypacks/transformers/training_args.py", line 2171, in device rank7: return self._setup_devices rank7: File "/workspace/pypacks/transformers/utils/generic.py", line 60, in get rank7: cached = self.fget(obj) rank7: File "/workspace/pypacks/transformers/training_args.py", line 2108, in _setup_devices rank7: self.distributed_state = PartialState(**accelerator_state_kwargs) rank7: File "/workspace/pypacks/accelerate/state.py", line 280, in init
rank7: File "/workspace/pypacks/accelerate/state.py", line 790, in set_device
rank7: File "/workspace/pypacks/torch/cuda/init.py", line 399, in set_device
rank7: RuntimeError: CUDA error: invalid device ordinal rank7: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank7: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank7: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.config.json: 100%|████████████████████████████████████████████████████████| 1.37k/1.37k [00:00<00:00, 6.74MB/s] configuration_minicpm.py: 100%|███████████████████████████████████████████| 4.06k/4.06k [00:00<00:00, 20.8MB/s] A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5:
resampler.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. W0720 21:08:52.136000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1751 closing signal SIGTERM W0720 21:08:52.137000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1752 closing signal SIGTERM E0720 21:08:52.466000 139678581792768 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 2 (pid: 1753) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/workspace/pypacks/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/workspace/pypacks/torch/distributed/run.py", line 879, in main
run(args)
File "/workspace/pypacks/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/workspace/pypacks/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/pypacks/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures: [1]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 3 (local_rank: 3) exitcode : 1 (pid: 1754) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 4 (local_rank: 4) exitcode : 1 (pid: 1755) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 5 (local_rank: 5) exitcode : 1 (pid: 1756) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 6 (local_rank: 6) exitcode : 1 (pid: 1757) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 7 (local_rank: 7) exitcode : 1 (pid: 1758) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-07-20_21:08:52 host : 584a1774aaad rank : 2 (local_rank: 2) exitcode : 1 (pid: 1753) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
root@584a1774aaad:/workspace/minicpm/finetune#
期望行为 | Expected Behavior
Finetuning to not have CUDA errors
复现方法 | Steps To Reproduce
Take a few rows and images per the vl_finetune_data.json and try to run ./finetune_lora.sh and run on unbuntu
运行环境 | Environment
备注 | Anything else?
No response