HKUDS / UrbanGPT

[KDD'2024] "UrbanGPT: Spatio-Temporal Large Language Models"
https://urban-gpt.github.io
Apache License 2.0
200 stars 25 forks source link

模型并行 or 数据并行?训练的GPU资源情况? #3

Closed HuizhaoWang closed 4 months ago

HuizhaoWang commented 5 months ago

对vicuna-1.5-7b-16k 进行微调时的GPU情况是怎么样?模型并行 or 数据并行?

我们正在尝试使用8张3090(每张24G)复现微调模型的过程,但是发生了以下错误:

File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 6 has a total capacty of 23.69 GiB of which 84.94 MiB is free. Process 40141 has 23.61 GiB memory in use. Of the allocated memory 22.62 GiB is allocated by PyTorch, and 8.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 616 closing signal SIGTERM [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 618 closing signal SIGTERM [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 619 closing signal SIGTERM [2024-03-29 16:22:19,104] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 621 closing signal SIGTERM [2024-03-29 16:22:19,104] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 622 closing signal SIGTERM [2024-03-29 16:22:20,873] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 615) of binary: /usr/local/miniconda3/bin/python Traceback (most recent call last): File "/usr/local/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 810, in main() File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

urbangpt/train/train_mem.py FAILED

Failures: [1]: time : 2024-03-29_16:22:19 host : train-urbangpt-llm-kg-0 rank : 2 (local_rank: 2) exitcode : 1 (pid: 617) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-29_16:22:19 host : train-urbangpt-llm-kg-0 rank : 5 (local_rank: 5) exitcode : 1 (pid: 620) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-29_16:22:19 host : train-urbangpt-llm-kg-0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 615) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请帮我看一看

weigan33 commented 5 months ago

same problem

LZH-YS1998 commented 5 months ago

你好,我们采用8*A100-PCIE进行并行训练(显存40G)。3090可能出现显存不足的问题,可以参照GraphGPT使用lightning进行修改。

weigan33 commented 5 months ago

您好,请问具体在哪修改呢,谢谢

LZH-YS1998 commented 5 months ago

您好,可以根据GraphGPT的指引对urbangpt的train_st.py文件进行修改。 参考文件:train_light