I get a CUDA error when training data, how should I solve this problem? I've tried setting PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32, but the problem persists. The stack information is as follows:
Traceback (most recent call last):
File "train.py", line 104, in
main()
File "train.py", line 93, in main
trainer.train(cfg,
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/projects/neuralangelo/trainer.py", line 107, in train
super().train(cfg, data_loader, single_gpu, profile, show_pbar)
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/projects/nerf/trainers/base.py", line 115, in train
super().train(cfg, data_loader, single_gpu, profile, show_pbar)
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 520, in train
self.end_of_epoch(data, current_epoch + 1, current_iteration)
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 365, in end_of_epoch
self.checkpointer.save(current_epoch, current_iteration)
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 578, in save
save_dict = to_cpu(self._collect_state_dicts())
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 120, in to_cpu
return to_device(data, 'cpu')
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in to_device
return type(data)({key: to_device(data[key], device) for key in data})
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in
return type(data)({key: to_device(data[key], device) for key in data})
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in to_device
return type(data)({key: to_device(data[key], device) for key in data})
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in
return type(data)({key: to_device(data[key], device) for key in data})
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 95, in to_device
data = data.to(device, non_blocking=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 734) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I get a CUDA error when training data, how should I solve this problem? I've tried setting PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32, but the problem persists. The stack information is as follows:
Traceback (most recent call last): File "train.py", line 104, in
main()
File "train.py", line 93, in main
trainer.train(cfg,
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/projects/neuralangelo/trainer.py", line 107, in train
super().train(cfg, data_loader, single_gpu, profile, show_pbar)
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/projects/nerf/trainers/base.py", line 115, in train
super().train(cfg, data_loader, single_gpu, profile, show_pbar)
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 520, in train
self.end_of_epoch(data, current_epoch + 1, current_iteration)
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 365, in end_of_epoch
self.checkpointer.save(current_epoch, current_iteration)
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/trainers/base.py", line 578, in save
save_dict = to_cpu(self._collect_state_dicts())
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 120, in to_cpu
return to_device(data, 'cpu')
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in to_device
return type(data)({key: to_device(data[key], device) for key in data})
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in
return type(data)({key: to_device(data[key], device) for key in data})
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in to_device
return type(data)({key: to_device(data[key], device) for key in data})
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 98, in
return type(data)({key: to_device(data[key], device) for key in data})
File "/mnt/OpenSource/neuralangelo-main/neuralangelo/imaginaire/utils/misc.py", line 95, in to_device
data = data.to(device, non_blocking=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 734) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: