Closed KisAaki closed 1 week ago
Hi,
How many and what types of GPUs do you have? It looks like you does not have 8 gpus in your machine. Also, can you set CUDA_LAUNCH_BLOCKING=1 as the environment variable and run again to get the accurate error infos?
Best, Runsen
thanks for your comment. We have 7 GPUS ,all for NVIDIA GeForce RTX 4090 and Memory is 24564MB. I can't find where to set the devices. Such as the numbers or something else.
Here is the infos when using CUDA_LAUNCH_BLOCKING=1, I hope it will help:
scripts/PointLLM_train_stage1.sh
W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757]
W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757]
W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757]
rank7: Traceback (most recent call last):
rank7: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 15, in
rank7: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 97, in train
rank7: model_args, data_args, training_args = parser.parse_args_into_dataclasses()
rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
rank7: obj = dtype(**inputs)
rank7: File "
rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device rank7: return self._setup_devices rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get rank7: cached = self.fget(obj) rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
rank7: File "/home/lyc/.local/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
rank7: RuntimeError: CUDA error: invalid device ordinal
rank7: Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
2024-07-11 01:40:02 - ERROR - stderr - warnings.warn(
2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
2024-07-11 01:40:02 - ERROR - stderr - warnings.warn(
2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
2024-07-11 01:40:02 - ERROR - stderr - warnings.warn(
2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
2024-07-11 01:40:02 - ERROR - stderr - warnings.warn(
2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
2024-07-11 01:40:02 - ERROR - stderr - warnings.warn(
2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
2024-07-11 01:40:02 - ERROR - stderr - warnings.warn(
2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
2024-07-11 01:40:02 - ERROR - stderr - warnings.warn(
W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003760 closing signal SIGTERM
W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003761 closing signal SIGTERM
W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003762 closing signal SIGTERM
W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003763 closing signal SIGTERM
W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003764 closing signal SIGTERM
W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003765 closing signal SIGTERM
W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003766 closing signal SIGTERM
E0711 01:40:08.000000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 2003767) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python
Traceback (most recent call last):
File "/home/lyc/.local/bin/torchrun", line 8, in Failures:
Here.
Thanks!!!.But some error still happens. I guess it's caused by the first GPU I use is 15594/24564MiB. How can I choose the first GPU I use? I try to change the param nnodes but it doesn't work for me.
I may not have made it clear enough. When I set nproc_per_node to 6, it starts using the 0th GPU of my server. But GPU 0 is currently occupied. How should I start my training from the first card? (GPU No. 1-GPU No. 7)
Use CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 to avoid using GPU0
Root Cause (first observed failure): [0]: time : 2024-07-11_10:38:41 host : nuosen rank : 0 (local_rank: 0) exitcode : 1 (pid: 699542) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The error messages show this, maybe you changed something.
Also, please make the fonts of your messages look comfortable before posting ....
Sorry, I'll pay attention to it.
你好!很抱歉再次打扰你。我检查了目录 PointLLM/pointllm/model/pointbert下的文件,并没有发现CHANGE_ME!.yaml 这个文件。为了防止因为我的修改所导致的错误,我重新从 git 上进行 clone 并进行实验,然而还是一样的错误。我在代码中发现 PointBERT config 理应是从PointLLM\pointllm\model\pointbert 下的 PointTransformer_base_8192point.yaml 的文件中进行读取,为什么会出现这样的错误呢?
在项目中我可能唯一修改较多的文件是 scripts/PointLLM_train_stage1.sh.因为按照默认设定的 dir_path=PointLLM,它会报出无法获取到这个路径的错误。因此我将此 sh 文件中的路径修改为了绝对路径。
祝好,KisAaki.
thanks for your comment. We have 7 GPUS ,all for NVIDIA GeForce RTX 4090 and Memory is 24564MB. I can't find where to set the devices. Such as the numbers or something else.
Here is the infos when using CUDA_LAUNCH_BLOCKING=1, I hope it will help:
scripts/PointLLM_train_stage1.sh W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] [rank7]: Traceback (most recent call last): [rank7]: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 15, in [rank7]: train() [rank7]: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 97, in train [rank7]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses [rank7]: obj = dtype(inputs) [rank7]: File "", line 120, in init [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init [rank7]: and (self.device.type != "cuda") [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device [rank7]: return self._setup_devices [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get [rank7]: cached = self.fget(obj) [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices [rank7]: torch.cuda.set_device(device) [rank7]: File "/home/lyc/.local/lib/python3.10/site-packages/torch/cuda/init**.py", line 399, in set_device [rank7]: torch._C._cuda_setDevice(device) [rank7]: RuntimeError: CUDA error: invalid device ordinal [rank7]: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003760 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003761 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003762 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003763 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003764 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003765 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003766 closing signal SIGTERM E0711 01:40:08.000000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 2003767) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python Traceback (most recent call last): File "/home/lyc/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, kwargs) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:pointllm/train/train_mem.py FAILED
Failures:
Root Cause (first observed failure): [0]: time : 2024-07-11_01:40:07 host : nuosen rank : 7 (local_rank: 7) exitcode : 1 (pid: 2003767) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I have checked all the previous issues and carefully read the docs, but got the same problem here.
I set CUDA_LAUNCH_BLOCKING=1 and still got the error. This error occurred when it turned to save models. I monitored the GPU memory usage and found out that GPU0 will be out of memory in this phase. When saving weights, the main GPU will collect all weights and leads to memory problems ( I guess...). I only have A100 40G for training. Any suggestion for solving this problem?
你好!很抱歉再次打扰你。我检查了目录 PointLLM/pointllm/model/pointbert下的文件,并没有发现CHANGE_ME!.yaml 这个文件。为了防止因为我的修改所导致的错误,我重新从 git 上进行 clone 并进行实验,然而还是一样的错误。我在代码中发现 PointBERT config 理应是从PointLLM\pointllm\model\pointbert 下的 PointTransformer_base_8192point.yaml 的文件中进行读取,为什么会出现这样的错误呢? 在项目中我可能唯一修改较多的文件是 scripts/PointLLM_train_stage1.sh.因为按照默认设定的 dir_path=PointLLM,它会报出无法获取到这个路径的错误。因此我将此 sh 文件中的路径修改为了绝对路径。
祝好,KisAaki.
I am sorry. My README.md misses an important step. After you download the PointLLM_7B_v1.1 checkpoint, you should modify the config.json add specify the yaml to be used here.
Usually, you should set:
"point_backbone_config_name": "PointTransformer_8192point_2layer"
I have updated the config file. Sorry for confusing.
thanks for your comment. We have 7 GPUS ,all for NVIDIA GeForce RTX 4090 and Memory is 24564MB. I can't find where to set the devices. Such as the numbers or something else. Here is the infos when using CUDA_LAUNCH_BLOCKING=1, I hope it will help: scripts/PointLLM_train_stage1.sh W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] [rank7]: Traceback (most recent call last): [rank7]: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 15, in [rank7]: train() [rank7]: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 97, in train [rank7]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses [rank7]: obj = dtype(inputs) [rank7]: File "", line 120, in init [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init [rank7]: and (self.device.type != "cuda") [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device [rank7]: return self._setup_devices [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get [rank7]: cached = self.fget(obj) [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices [rank7]: torch.cuda.set_device(device) [rank7]: File "/home/lyc/.local/lib/python3.10/site-packages/torch/cuda/init**.py", line 399, in set_device [rank7]: torch._C._cuda_setDevice(device) [rank7]: RuntimeError: CUDA error: invalid device ordinal [rank7]: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003760 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003761 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003762 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003763 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003764 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003765 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003766 closing signal SIGTERM E0711 01:40:08.000000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 2003767) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python Traceback (most recent call last): File "/home/lyc/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, kwargs) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:pointllm/train/train_mem.py FAILED
Failures:
Root Cause (first observed failure): [0]: time : 2024-07-11_01:40:07 host : nuosen rank : 7 (local_rank: 7) exitcode : 1 (pid: 2003767) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html I have checked all the previous issues and carefully read the docs, but got the same problem here.
I set CUDA_LAUNCH_BLOCKING=1 and still got the error. This error occurred when it turned to save models. I monitored the GPU memory usage and found out that GPU0 will be out of memory in this phase. When saving weights, the main GPU will collect all weights and leads to memory problems ( I guess...). I only have A100 40G for training. Any suggestion for solving this problem?
As your problem seems different, could you please open another issue and show your training script and complete error messages?
你好!感谢你耐心的回答以及容忍我的字体问题 :( 不过我的实验目前仍然存在一些问题。我们实验室的显卡内存为 24564 MIB,不足以进行 stage1 的训练任务。请问有某些方法能够让我继续的进行实验吗?比如说修改一些特定的参数来减少内存占用量,或是将同一进程布置在多卡上。我在论文中注意到你们是使用8张 A100 80G 完成的实验,请问在我使用了 checkpoint 之后,我的显卡性能以及内存大小会给实验带来了哪些影响呢? checkpoint能带给我的好处又有哪些呢?很抱歉我的问题如此之多,我刚刚接触到3D和LLM结合的这个领域,有许多知识都不太清楚 :(
ERROR - stderr - [rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU has a total capacity of 23.65 GiB of which 103.06 MiB is free. Including non-PyTorch memory, this process has 23.53 GiB memory in use. Of the allocated memory 23.15 GiB is allocated by PyTorch, and 4.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
It's a bit challenging to run training with 24G memory. You need to distribute multiple transformer layers to different GPUs and use float16 for training. Only one 24G GPU may not be enough.
Another promising method is to change the LLM we used, which is LLaMA-7B to some smaller LLM like Phi-2 https://huggingface.co/microsoft/phi-2
Also, you can replace the attention implementation used xformer, which provides a more memory-efficient way. You can refer to https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain_xformers.sh
Hi,when I was doing the stage-1 training, I met some problems.It seems like the problem is caused by the CUDA_DEVICES, but I can't find the device configure in the train.py.Can you help me out? Here is the details:
scripts/PointLLM_train_stage1.sh W0710 18:28:47.395000 140217301160576 torch/distributed/run.py:757] W0710 18:28:47.395000 140217301160576 torch/distributed/run.py:757] W0710 18:28:47.395000 140217301160576 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0710 18:28:47.395000 140217301160576 torch/distributed/run.py:757] 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( rank7: Traceback (most recent call last): rank7: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 13, inrank7: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 97, in train rank7: model_args, data_args, training_args = parser.parse_args_into_dataclasses() rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses rank7: obj = dtype(**inputs) rank7: File "", line 120, in init
rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in __post_init__
rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device rank7: return self._setup_devices rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get rank7: cached = self.fget(obj) rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices
rank7: File "/home/lyc/.local/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device
rank7: RuntimeError: CUDA error: invalid device ordinal rank7: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank7: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank7: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
sys.exit(main())
File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( W0710 18:28:57.410000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161474 closing signal SIGTERM W0710 18:28:57.410000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161476 closing signal SIGTERM W0710 18:28:57.410000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161477 closing signal SIGTERM W0710 18:28:57.412000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161479 closing signal SIGTERM W0710 18:28:57.412000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161480 closing signal SIGTERM W0710 18:28:57.412000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161482 closing signal SIGTERM W0710 18:28:57.412000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161483 closing signal SIGTERM E0710 18:28:57.736000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 2161485) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python Traceback (most recent call last): File "/home/lyc/.local/bin/torchrun", line 8, inpointllm/train/train_mem.py FAILED
Failures: