CUDA error: out of memory when run train

lie12huo commented 1 year ago

I get a CUDA error when training data, how should I solve this problem? I've tried setting PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32, but the problem persists. The stack information is as follows:

Traceback (most recent call last): File "train.py", line 104, in main() File "train.py", line 93, in main trainer.train(cfg, File "/mnt/OpenSource/neuralangelo-main/neuralangelo/projects/neuralangelo/trainer.py", line 107, in train super().train(cfg, data_loader, single_gpu, profile, show_pbar) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/projects/nerf/trainers/base.py", line 115, in train super().train(cfg, data_loader, single_gpu, profile, show_pbar) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 520, in train self.end_of_epoch(data, current_epoch + 1, current_iteration) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 365, in end_of_epoch self.checkpointer.save(current_epoch, current_iteration) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 578, in save save_dict = to_cpu(self._collect_state_dicts()) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 120, in to_cpu return to_device(data, 'cpu') File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in to_device return type(data)({key: to_device(data[key], device) for key in data}) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in return type(data)({key: to_device(data[key], device) for key in data}) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in to_device return type(data)({key: to_device(data[key], device) for key in data}) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in return type(data)({key: to_device(data[key], device) for key in data}) File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 95, in to_device data = data.to(device, non_blocking=True) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 734) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

chenhsuanlin commented 1 year ago

@lie12huo please see the FAQ section in README, thanks!

lie12huo commented 1 year ago

grateful! I have solved the problem by this method.

NVlabs / neuralangelo

CUDA error: out of memory when run train #90