File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 6 has a total capacty of 23.69 GiB of which 84.94 MiB is free. Process 40141 has 23.61 GiB memory in use. Of the allocated memory 22.62 GiB is allocated by PyTorch, and 8.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 616 closing signal SIGTERM
[2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 618 closing signal SIGTERM
[2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 619 closing signal SIGTERM
[2024-03-29 16:22:19,104] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 621 closing signal SIGTERM
[2024-03-29 16:22:19,104] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 622 closing signal SIGTERM
[2024-03-29 16:22:20,873] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 615) of binary: /usr/local/miniconda3/bin/python
Traceback (most recent call last):
File "/usr/local/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 810, in
main()
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
对vicuna-1.5-7b-16k 进行微调时的GPU情况是怎么样?模型并行 or 数据并行?
我们正在尝试使用8张3090(每张24G)复现微调模型的过程,但是发生了以下错误:
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 6 has a total capacty of 23.69 GiB of which 84.94 MiB is free. Process 40141 has 23.61 GiB memory in use. Of the allocated memory 22.62 GiB is allocated by PyTorch, and 8.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 616 closing signal SIGTERM [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 618 closing signal SIGTERM [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 619 closing signal SIGTERM [2024-03-29 16:22:19,104] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 621 closing signal SIGTERM [2024-03-29 16:22:19,104] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 622 closing signal SIGTERM [2024-03-29 16:22:20,873] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 615) of binary: /usr/local/miniconda3/bin/python Traceback (most recent call last): File "/usr/local/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 810, in
main()
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
urbangpt/train/train_mem.py FAILED
Failures: [1]: time : 2024-03-29_16:22:19 host : train-urbangpt-llm-kg-0 rank : 2 (local_rank: 2) exitcode : 1 (pid: 617) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-29_16:22:19 host : train-urbangpt-llm-kg-0 rank : 5 (local_rank: 5) exitcode : 1 (pid: 620) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-03-29_16:22:19 host : train-urbangpt-llm-kg-0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 615) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
请帮我看一看