Closed ihorizons2022 closed 1 year ago
The default config consumes ~24GB GPU memory, so it would OOM on such GPUs. Please also see #4 for discussions on lowering the memory consumption.
Please see the FAQ section on how to adjust the hyperparameters
2023-08-13 07:32:27.550330: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-08-13 07:32:28.516815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Training with 1 GPUs. Using random seed 0 Make folder logs/2023_0813_0732_30_toy_example
wandb_scalar_iter: 100 cudnn benchmark: True cudnn deterministic: False Setup trainer. Using random seed 0 model parameter count: 366,706,268 Initialize model weights using type: none, gain: None Using random seed 0 Allow TensorFloat32 operations on supported devices Train dataset length: 29 /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Val dataset length: 4 Training from scratch. Initialize wandb Evaluating: 0% 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Evaluating with 4 samples. Traceback (most recent call last): File "/content/neuralangelo/train.py", line 104, in
main()
File "/content/neuralangelo/train.py", line 93, in main
trainer.train(cfg,
File "/content/neuralangelo/projects/neuralangelo/trainer.py", line 106, in train
super().train(cfg, data_loader, single_gpu, profile, show_pbar)
File "/content/neuralangelo/projects/nerf/trainers/base.py", line 115, in train
super().train(cfg, data_loader, single_gpu, profile, show_pbar)
File "/content/neuralangelo/imaginaire/trainers/base.py", line 503, in train
self.train_step(data, last_iter_in_epoch=(it == len(data_loader) - 1))
File "/content/neuralangelo/imaginaire/trainers/base.py", line 446, in train_step
self.scaler.scale(total_loss).backward()
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, args)
File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 116, in backward
input_grad, params_grad = _module_function_backward.apply(ctx, doutput, input, params, output)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(args, *kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 129, in forward
params_grad = null_tensor_like(params) if params_grad is None else (params_grad / ctx_fwd.loss_scale)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 698.00 MiB (GPU 0; 14.75 GiB total capacity; 13.21 GiB already allocated; 244.81 MiB free; 14.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33279) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f( args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: