CUDA out of memory, google colab T4, 15G VRAM on toy example

2023-08-13 07:32:27.550330: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-08-13 07:32:28.516815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Training with 1 GPUs. Using random seed 0 Make folder logs/2023_0813_0732_30_toy_example

checkpoint:
- save_epoch: 9999999999
- save_iter: 20000
- save_latest_iter: 9999999999
- save_period: 9999999999
- strict_resume: True
cudnn:
- benchmark: True
- deterministic: False
data:
- name: dummy
- num_images: 29
- num_workers: 4
- preload: True
- readjust:
  - center: [0.0, 0.0, 0.0]
  - scale: 1.0
- root: /content/toy/toy_example_skip24/dense
- train:
  - batch_size: 2
  - image_size: [707, 1258]
  - subset: None
- type: projects.neuralangelo.data
- use_multi_epoch_loader: True
- val:
  - batch_size: 2
  - image_size: [300, 533]
  - max_viz_samples: 16
  - subset: 4
image_save_iter: 9999999999
inference_args:
local_rank: 0
logdir: logs/2023_0813_0732_30_toy_example
logging_iter: 9999999999999
max_epoch: 9999999999
max_iter: 500000
metrics_epoch: None
metrics_iter: None
model:
- appear_embed:
  - dim: 8
  - enabled: True
- background:
  - enabled: True
  - encoding:
    - levels: 10
    - type: fourier
  - encoding_view:
    - levels: 3
    - type: spherical
  - mlp:
    - activ: relu
    - activ_density: softplus
    - activ_density_params:
    - activ_params:
    - hidden_dim: 256
    - hidden_dim_rgb: 128
    - num_layers: 8
    - num_layers_rgb: 2
    - skip: [4]
    - skip_rgb: []
  - view_dep: True
  - white: False
- object:
  - rgb:
    - encoding_view:
    - levels: 3
    - type: spherical
    - mlp:
    - activ: relu_
    - activ_params:
    - hidden_dim: 256
    - num_layers: 4
    - skip: []
    - weight_norm: True
    - mode: idr
  - s_var:
    - anneal_end: 0.1
    - init_val: 3.0
  - sdf:
    - encoding:
    - coarse2fine:
      - enabled: True
      - init_active_level: 4
      - step: 5000
    - hashgrid:
      - dict_size: 22
      - dim: 8
      - max_logres: 11
      - min_logres: 5
      - range: [-2, 2]
    - levels: 16
    - type: hashgrid
    - gradient:
    - mode: numerical
    - taps: 4
    - mlp:
    - activ: softplus
    - activ_params:
      - beta: 100
    - geometric_init: True
    - hidden_dim: 256
    - inside_out: False
    - num_layers: 2
    - out_bias: 0.5
    - skip: []
    - weight_norm: True
- render:
  - num_sample_hierarchy: 4
  - num_samples:
    - background: 32
    - coarse: 64
    - fine: 16
  - rand_rays: 512
  - stratified: True
- type: projects.neuralangelo.model
nvtx_profile: False
optim:
- fused_opt: False
- params:
  - lr: 0.001
  - weight_decay: 0.001
- sched:
  - gamma: 10.0
  - iteration_mode: True
  - step_size: 9999999999
  - two_steps: [300000, 400000]
  - type: two_steps_with_warmup
  - warm_up_end: 5000
- type: AdamW
pretrained_weight: None
source_filename: /content/neuralangelo/projects/neuralangelo/configs/custom/toy_example.yaml
speed_benchmark: False
test_data:
- name: dummy
- num_workers: 0
- test:
  - batch_size: 1
  - is_lmdb: False
  - roots: None
- type: imaginaire.datasets.images
timeout_period: 9999999
trainer:
- amp_config:
  - backoff_factor: 0.5
  - enabled: False
  - growth_factor: 2.0
  - growth_interval: 2000
  - init_scale: 65536.0
- ddp_config:
  - find_unused_parameters: False
  - static_graph: True
- depth_vis_scale: 0.5
- ema_config:
  - beta: 0.9999
  - enabled: False
  - load_ema_checkpoint: False
  - start_iteration: 0
- grad_accum_iter: 1
- image_to_tensorboard: False
- init:
  - gain: None
  - type: none
- loss_weight:
  - curvature: 0.0005
  - eikonal: 0.1
  - render: 1.0
- type: projects.neuralangelo.trainer
validation_iter: 5000
wandb_image_iter: 10000
wandb_scalar_iter: 100 cudnn benchmark: True cudnn deterministic: False Setup trainer. Using random seed 0 model parameter count: 366,706,268 Initialize model weights using type: none, gain: None Using random seed 0 Allow TensorFloat32 operations on supported devices Train dataset length: 29 /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Val dataset length: 4 Training from scratch. Initialize wandb Evaluating: 0% 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Evaluating with 4 samples. Traceback (most recent call last): File "/content/neuralangelo/train.py", line 104, in main() File "/content/neuralangelo/train.py", line 93, in main trainer.train(cfg, File "/content/neuralangelo/projects/neuralangelo/trainer.py", line 106, in train super().train(cfg, data_loader, single_gpu, profile, show_pbar) File "/content/neuralangelo/projects/nerf/trainers/base.py", line 115, in train super().train(cfg, data_loader, single_gpu, profile, show_pbar) File "/content/neuralangelo/imaginaire/trainers/base.py", line 503, in train self.train_step(data, last_iter_in_epoch=(it == len(data_loader) - 1)) File "/content/neuralangelo/imaginaire/trainers/base.py", line 446, in train_step self.scaler.scale(total_loss).backward() File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, args) File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 116, in backward input_grad, params_grad = _module_function_backward.apply(ctx, doutput, input, params, output) File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply return super().apply(args, *kwargs) # type: ignore[misc] File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 129, in forward params_grad = null_tensor_like(params) if params_grad is None else (params_grad / ctx_fwd.loss_scale) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 698.00 MiB (GPU 0; 14.75 GiB total capacity; 13.21 GiB already allocated; 244.81 MiB free; 14.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33279) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-13_07:34:34 host : 5310e9ddb173 rank : 0 (local_rank: 0) exitcode : 1 (pid: 33279) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

NVlabs / neuralangelo

CUDA out of memory, google colab T4, 15G VRAM on toy example #7

train.py FAILED