File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 6 has a total capacty of 23.69 GiB of which 84.94 MiB is free. Process 40141 has 23.61 GiB memory in use. Of the allocated memory 22.62 GiB is allocated by PyTorch, and 8.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 616 closing signal SIGTERM [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 618 closing signal SIGTERM [2024-03-29 16:22:19,103] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 619 closing signal SIGTERM [2024-03-29 16:22:19,104] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 621 closing signal SIGTERM [2024-03-29 16:22:19,104] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 622 closing signal SIGTERM [2024-03-29 16:22:20,873] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 615) of binary: /usr/local/miniconda3/bin/python Traceback (most recent call last): File "/usr/local/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 810, in main() File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

urbangpt/train/train_mem.py FAILED

请帮我看一看

weigan33 commented 5 months ago

same problem

LZH-YS1998 commented 5 months ago

你好，我们采用8*A100-PCIE进行并行训练（显存40G）。3090可能出现显存不足的问题，可以参照GraphGPT使用lightning进行修改。

weigan33 commented 5 months ago

您好，请问具体在哪修改呢，谢谢

LZH-YS1998 commented 5 months ago

您好，可以根据GraphGPT的指引对urbangpt的train_st.py文件进行修改。参考文件：train_light