OpenRobotLab / PointLLM

[ECCV 2024] PointLLM: Empowering Large Language Models to Understand Point Clouds
https://runsenxu.com/projects/PointLLM
449 stars 22 forks source link

Training error #31

Closed KisAaki closed 1 week ago

KisAaki commented 2 weeks ago

Hi,when I was doing the stage-1 training, I met some problems.It seems like the problem is caused by the CUDA_DEVICES, but I can't find the device configure in the train.py.Can you help me out? Here is the details:

scripts/PointLLM_train_stage1.sh W0710 18:28:47.395000 140217301160576 torch/distributed/run.py:757] W0710 18:28:47.395000 140217301160576 torch/distributed/run.py:757] W0710 18:28:47.395000 140217301160576 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0710 18:28:47.395000 140217301160576 torch/distributed/run.py:757] 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( rank7: Traceback (most recent call last): rank7: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 13, in

rank7: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 97, in train rank7: model_args, data_args, training_args = parser.parse_args_into_dataclasses() rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses rank7: obj = dtype(**inputs) rank7: File "", line 120, in init rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in __post_init__

rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device rank7: return self._setup_devices rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get rank7: cached = self.fget(obj) rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices

rank7: File "/home/lyc/.local/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device

rank7: RuntimeError: CUDA error: invalid device ordinal rank7: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank7: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank7: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( 2024-07-10 18:28:52 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-10 18:28:52 - ERROR - stderr - warnings.warn( W0710 18:28:57.410000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161474 closing signal SIGTERM W0710 18:28:57.410000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161476 closing signal SIGTERM W0710 18:28:57.410000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161477 closing signal SIGTERM W0710 18:28:57.412000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161479 closing signal SIGTERM W0710 18:28:57.412000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161480 closing signal SIGTERM W0710 18:28:57.412000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161482 closing signal SIGTERM W0710 18:28:57.412000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2161483 closing signal SIGTERM E0710 18:28:57.736000 140217301160576 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 2161485) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python Traceback (most recent call last): File "/home/lyc/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

pointllm/train/train_mem.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-10_18:28:57 host : nuosen rank : 7 (local_rank: 7) exitcode : 1 (pid: 2161485) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
RunsenXu commented 2 weeks ago

Hi,

How many and what types of GPUs do you have? It looks like you does not have 8 gpus in your machine. Also, can you set CUDA_LAUNCH_BLOCKING=1 as the environment variable and run again to get the accurate error infos?

Best, Runsen

KisAaki commented 2 weeks ago

thanks for your comment. We have 7 GPUS ,all for NVIDIA GeForce RTX 4090 and Memory is 24564MB. I can't find where to set the devices. Such as the numbers or something else.

Here is the infos when using CUDA_LAUNCH_BLOCKING=1, I hope it will help:

scripts/PointLLM_train_stage1.sh W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] rank7: Traceback (most recent call last): rank7: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 15, in

rank7: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 97, in train rank7: model_args, data_args, training_args = parser.parse_args_into_dataclasses() rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses rank7: obj = dtype(**inputs) rank7: File "", line 120, in init rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in __post_init__

rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device rank7: return self._setup_devices rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get rank7: cached = self.fget(obj) rank7: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices

rank7: File "/home/lyc/.local/lib/python3.10/site-packages/torch/cuda/init.py", line 399, in set_device

rank7: RuntimeError: CUDA error: invalid device ordinal rank7: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003760 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003761 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003762 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003763 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003764 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003765 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003766 closing signal SIGTERM E0711 01:40:08.000000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 2003767) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python Traceback (most recent call last): File "/home/lyc/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

pointllm/train/train_mem.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-11_01:40:07 host : nuosen rank : 7 (local_rank: 7) exitcode : 1 (pid: 2003767) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
RunsenXu commented 2 weeks ago
image

Here.

KisAaki commented 2 weeks ago

Thanks!!!.But some error still happens. I guess it's caused by the first GPU I use is 15594/24564MiB. How can I choose the first GPU I use? I try to change the param nnodes but it doesn't work for me.

KisAaki commented 2 weeks ago

I may not have made it clear enough. When I set nproc_per_node to 6, it starts using the 0th GPU of my server. But GPU 0 is currently occupied. How should I start my training from the first card? (GPU No. 1-GPU No. 7)

RunsenXu commented 2 weeks ago

Use CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 to avoid using GPU0

KisAaki commented 2 weeks ago

Hi, thanks for your reply even in the so late evening. Sorry to disturb you again. I add the CUDA_VISIBLE_DEVICES=1,2,3,4 before the 'scripts/PointLLM_train_stage1.sh' in the terminal, I'm not sure if it's the right choice to do that. And, while I'm trying to run the code, it appears some files that even don't exist. It seems like a kind of Garbled characters. I try to search it on the Internet but nothing found.Sorry for I'm a trio for the LLM, so many things needs to notice. It says that I didn't have a file called:'CHANGE_ME!.yaml' here is the error infos: scripts/PointLLM_train_stage1.sh W0711 10:37:46.582000 140480182084224 torch/distributed/run.py:757] W0711 10:37:46.582000 140480182084224 torch/distributed/run.py:757] W0711 10:37:46.582000 140480182084224 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0711 10:37:46.582000 140480182084224 torch/distributed/run.py:757] 2024-07-11 10:38:36 - INFO - pointllm.model.pointllm - Using PointBERT. 2024-07-11 10:38:36 - INFO - stdout - Loading PointBERT config from /data2/2023/yzy/PointLLM/pointllm/model/pointbert/CHANGE_ME!.yaml. 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: Traceback (most recent call last): 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 13, in 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: train() 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 112, in train 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: model = PointLLMLlamaForCausalLM.from_pretrained( 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: model = cls(config, *model_args, model_kwargs) 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: File "/data2/2023/yzy/PointLLM/pointllm/model/pointllm.py", line 186, in init 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: self.model = PointLLMLlamaModel(config) 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: File "/data2/2023/yzy/PointLLM/pointllm/model/pointllm.py", line 41, in init 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: point_bert_config = cfg_from_yaml_file(point_bert_config_addr) 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: File "/data2/2023/yzy/PointLLM/pointllm/utils.py", line 38, in cfg_from_yaml_file 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: with open(cfg_file, 'r') as f: 2024-07-11 10:38:36 - ERROR - stderr - [rank3]: FileNotFoundError: [Errno 2] No such file or directory: '/data2/2023/yzy/PointLLM/pointllm/model/pointbert/CHANGE_ME!.yaml' 2024-07-11 10:38:37 - INFO - pointllm.model.pointllm - Using PointBERT. 2024-07-11 10:38:37 - INFO - stdout - Loading PointBERT config from /data2/2023/yzy/PointLLM/pointllm/model/pointbert/CHANGE_ME!.yaml. 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: Traceback (most recent call last): 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 13, in 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: train() 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 112, in train 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: model = PointLLMLlamaForCausalLM.from_pretrained( 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: model = cls(config, *model_args, *model_kwargs) 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: File "/data2/2023/yzy/PointLLM/pointllm/model/pointllm.py", line 186, in init 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: self.model = PointLLMLlamaModel(config) 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: File "/data2/2023/yzy/PointLLM/pointllm/model/pointllm.py", line 41, in init 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: point_bert_config = cfg_from_yaml_file(point_bert_config_addr) 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: File "/data2/2023/yzy/PointLLM/pointllm/utils.py", line 38, in cfg_from_yaml_file 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: with open(cfg_file, 'r') as f: 2024-07-11 10:38:37 - ERROR - stderr - [rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/data2/2023/yzy/PointLLM/pointllm/model/pointbert/CHANGE_ME!.yaml' 2024-07-11 10:38:38 - INFO - pointllm.model.pointllm - Using PointBERT. 2024-07-11 10:38:38 - INFO - stdout - Loading PointBERT config from /data2/2023/yzy/PointLLM/pointllm/model/pointbert/CHANGE_ME!.yaml. 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: Traceback (most recent call last): 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 13, in 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: train() 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 112, in train 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: model = PointLLMLlamaForCausalLM.from_pretrained( 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: model = cls(config, model_args, model_kwargs) 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: File "/data2/2023/yzy/PointLLM/pointllm/model/pointllm.py", line 186, in init 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: self.model = PointLLMLlamaModel(config) 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: File "/data2/2023/yzy/PointLLM/pointllm/model/pointllm.py", line 41, in init 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: point_bert_config = cfg_from_yaml_file(point_bert_config_addr) 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: File "/data2/2023/yzy/PointLLM/pointllm/utils.py", line 38, in cfg_from_yaml_file 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: with open(cfg_file, 'r') as f: 2024-07-11 10:38:38 - ERROR - stderr - [rank2]: FileNotFoundError: [Errno 2] No such file or directory: '/data2/2023/yzy/PointLLM/pointllm/model/pointbert/CHANGE_ME!.yaml' 2024-07-11 10:38:39 - INFO - pointllm.model.pointllm - Using PointBERT. 2024-07-11 10:38:39 - INFO - stdout - Loading PointBERT config from /data2/2023/yzy/PointLLM/pointllm/model/pointbert/CHANGE_ME!.yaml. 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: Traceback (most recent call last): 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 13, in 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: train() 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 112, in train 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: model = PointLLMLlamaForCausalLM.from_pretrained( 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: model = cls(config, *model_args, *model_kwargs) 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: File "/data2/2023/yzy/PointLLM/pointllm/model/pointllm.py", line 186, in init 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: self.model = PointLLMLlamaModel(config) 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: File "/data2/2023/yzy/PointLLM/pointllm/model/pointllm.py", line 41, in init 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: point_bert_config = cfg_from_yaml_file(point_bert_config_addr) 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: File "/data2/2023/yzy/PointLLM/pointllm/utils.py", line 38, in cfg_from_yaml_file 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: with open(cfg_file, 'r') as f: 2024-07-11 10:38:39 - ERROR - stderr - [rank1]: FileNotFoundError: [Errno 2] No such file or directory: '/data2/2023/yzy/PointLLM/pointllm/model/pointbert/CHANGE_ME!.yaml' W0711 10:38:41.638000 140480182084224 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 699543 closing signal SIGTERM W0711 10:38:41.638000 140480182084224 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 699545 closing signal SIGTERM E0711 10:38:41.868000 140480182084224 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 699542) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python Traceback (most recent call last): File "/home/lyc/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(args, **kwargs) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

pointllm/train/train_mem.py FAILED

Failures: [1]: time : 2024-07-11_10:38:41 host : nuosen rank : 3 (local_rank: 3) exitcode : 1 (pid: 699546) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-07-11_10:38:41 host : nuosen rank : 0 (local_rank: 0) exitcode : 1 (pid: 699542) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

RunsenXu commented 2 weeks ago
image

The error messages show this, maybe you changed something.

Also, please make the fonts of your messages look comfortable before posting ....

KisAaki commented 2 weeks ago

Sorry, I'll pay attention to it.

KisAaki commented 2 weeks ago

你好!很抱歉再次打扰你。我检查了目录 PointLLM/pointllm/model/pointbert下的文件,并没有发现CHANGE_ME!.yaml 这个文件。为了防止因为我的修改所导致的错误,我重新从 git 上进行 clone 并进行实验,然而还是一样的错误。我在代码中发现 PointBERT config 理应是从PointLLM\pointllm\model\pointbert 下的 PointTransformer_base_8192point.yaml 的文件中进行读取,为什么会出现这样的错误呢? 在项目中我可能唯一修改较多的文件是 scripts/PointLLM_train_stage1.sh.因为按照默认设定的 dir_path=PointLLM,它会报出无法获取到这个路径的错误。因此我将此 sh 文件中的路径修改为了绝对路径。 image 祝好,KisAaki.

MathewCrespo commented 2 weeks ago

thanks for your comment. We have 7 GPUS ,all for NVIDIA GeForce RTX 4090 and Memory is 24564MB. I can't find where to set the devices. Such as the numbers or something else.

Here is the infos when using CUDA_LAUNCH_BLOCKING=1, I hope it will help:

scripts/PointLLM_train_stage1.sh W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] [rank7]: Traceback (most recent call last): [rank7]: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 15, in [rank7]: train() [rank7]: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 97, in train [rank7]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses [rank7]: obj = dtype(inputs) [rank7]: File "", line 120, in init [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init [rank7]: and (self.device.type != "cuda") [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device [rank7]: return self._setup_devices [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get [rank7]: cached = self.fget(obj) [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices [rank7]: torch.cuda.set_device(device) [rank7]: File "/home/lyc/.local/lib/python3.10/site-packages/torch/cuda/init**.py", line 399, in set_device [rank7]: torch._C._cuda_setDevice(device) [rank7]: RuntimeError: CUDA error: invalid device ordinal [rank7]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.

2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003760 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003761 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003762 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003763 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003764 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003765 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003766 closing signal SIGTERM E0711 01:40:08.000000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 2003767) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python Traceback (most recent call last): File "/home/lyc/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, kwargs) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

pointllm/train/train_mem.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2024-07-11_01:40:07 host : nuosen rank : 7 (local_rank: 7) exitcode : 1 (pid: 2003767) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I have checked all the previous issues and carefully read the docs, but got the same problem here.

I set CUDA_LAUNCH_BLOCKING=1 and still got the error. This error occurred when it turned to save models. I monitored the GPU memory usage and found out that GPU0 will be out of memory in this phase. When saving weights, the main GPU will collect all weights and leads to memory problems ( I guess...). I only have A100 40G for training. Any suggestion for solving this problem?

RunsenXu commented 2 weeks ago

你好!很抱歉再次打扰你。我检查了目录 PointLLM/pointllm/model/pointbert下的文件,并没有发现CHANGE_ME!.yaml 这个文件。为了防止因为我的修改所导致的错误,我重新从 git 上进行 clone 并进行实验,然而还是一样的错误。我在代码中发现 PointBERT config 理应是从PointLLM\pointllm\model\pointbert 下的 PointTransformer_base_8192point.yaml 的文件中进行读取,为什么会出现这样的错误呢? 在项目中我可能唯一修改较多的文件是 scripts/PointLLM_train_stage1.sh.因为按照默认设定的 dir_path=PointLLM,它会报出无法获取到这个路径的错误。因此我将此 sh 文件中的路径修改为了绝对路径。 image 祝好,KisAaki.

I am sorry. My README.md misses an important step. After you download the PointLLM_7B_v1.1 checkpoint, you should modify the config.json add specify the yaml to be used here.

https://huggingface.co/RunsenXu/PointLLM_7B_v1.1_init/blob/369d67dd9d6f5df9e9226750463783d1ced18232/config.json#L24

Usually, you should set:

"point_backbone_config_name": "PointTransformer_8192point_2layer"

I have updated the config file. Sorry for confusing.

RunsenXu commented 2 weeks ago

thanks for your comment. We have 7 GPUS ,all for NVIDIA GeForce RTX 4090 and Memory is 24564MB. I can't find where to set the devices. Such as the numbers or something else. Here is the infos when using CUDA_LAUNCH_BLOCKING=1, I hope it will help: scripts/PointLLM_train_stage1.sh W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0711 01:39:57.623000 140019323921024 torch/distributed/run.py:757] [rank7]: Traceback (most recent call last): [rank7]: File "/data2/2023/yzy/PointLLM/pointllm/train/train_mem.py", line 15, in [rank7]: train() [rank7]: File "/data2/2023/yzy/PointLLM/pointllm/train/train.py", line 97, in train [rank7]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses [rank7]: obj = dtype(inputs) [rank7]: File "", line 120, in init [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1227, in post_init [rank7]: and (self.device.type != "cuda") [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1662, in device [rank7]: return self._setup_devices [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get [rank7]: cached = self.fget(obj) [rank7]: File "/data1/anaconda3/envs/yzy_pointllm/lib/python3.10/site-packages/transformers/training_args.py", line 1652, in _setup_devices [rank7]: torch.cuda.set_device(device) [rank7]: File "/home/lyc/.local/lib/python3.10/site-packages/torch/cuda/init**.py", line 399, in set_device [rank7]: torch._C._cuda_setDevice(device) [rank7]: RuntimeError: CUDA error: invalid device ordinal [rank7]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.

2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( 2024-07-11 01:40:02 - ERROR - stderr - /home/lyc/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. 2024-07-11 01:40:02 - ERROR - stderr - warnings.warn( W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003760 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003761 closing signal SIGTERM W0711 01:40:07.642000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003762 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003763 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003764 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003765 closing signal SIGTERM W0711 01:40:07.643000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2003766 closing signal SIGTERM E0711 01:40:08.000000 140019323921024 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 7 (pid: 2003767) of binary: /data1/anaconda3/envs/yzy_pointllm/bin/python Traceback (most recent call last): File "/home/lyc/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, kwargs) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lyc/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

pointllm/train/train_mem.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2024-07-11_01:40:07 host : nuosen rank : 7 (local_rank: 7) exitcode : 1 (pid: 2003767) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I have checked all the previous issues and carefully read the docs, but got the same problem here.

I set CUDA_LAUNCH_BLOCKING=1 and still got the error. This error occurred when it turned to save models. I monitored the GPU memory usage and found out that GPU0 will be out of memory in this phase. When saving weights, the main GPU will collect all weights and leads to memory problems ( I guess...). I only have A100 40G for training. Any suggestion for solving this problem?

As your problem seems different, could you please open another issue and show your training script and complete error messages?

KisAaki commented 2 weeks ago

你好!感谢你耐心的回答以及容忍我的字体问题 :( 不过我的实验目前仍然存在一些问题。我们实验室的显卡内存为 24564 MIB,不足以进行 stage1 的训练任务。请问有某些方法能够让我继续的进行实验吗?比如说修改一些特定的参数来减少内存占用量,或是将同一进程布置在多卡上。我在论文中注意到你们是使用8张 A100 80G 完成的实验,请问在我使用了 checkpoint 之后,我的显卡性能以及内存大小会给实验带来了哪些影响呢? checkpoint能带给我的好处又有哪些呢?很抱歉我的问题如此之多,我刚刚接触到3D和LLM结合的这个领域,有许多知识都不太清楚 :(

ERROR - stderr - [rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU has a total capacity of 23.65 GiB of which 103.06 MiB is free. Including non-PyTorch memory, this process has 23.53 GiB memory in use. Of the allocated memory 23.15 GiB is allocated by PyTorch, and 4.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

RunsenXu commented 2 weeks ago

It's a bit challenging to run training with 24G memory. You need to distribute multiple transformer layers to different GPUs and use float16 for training. Only one 24G GPU may not be enough.

Another promising method is to change the LLM we used, which is LLaMA-7B to some smaller LLM like Phi-2 https://huggingface.co/microsoft/phi-2

Also, you can replace the attention implementation used xformer, which provides a more memory-efficient way. You can refer to https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain_xformers.sh