Larger RAM usage with the new config system

barikata1984 commented 1 year ago

Description

Hi,

I tried to run main_nerf.py in the main branch. But it suddenly stopped showing a one-word line Killed. It is presumably due to RAM shortage, according to google. I checked the usage and it reached its limit immediately before the app stopped. Do you have any idea how to deal with this issue?

I followed all the installation procedures, including requirements_app.txt. main_nerf.py in the stable branch works without any problems. So, if the config system is the only major change between the main and stable branches, the issue should be caused by the new config system. I suppose you can reproduce the larger RAM usage in your environment.

I installed pyopengl_accelerate separately because a msg telling the module is missing appeared when I ran the stable main_nerf.py for the first time, but the conda env should be clean to run wisp apps.

I know the easiest solution is increasing RAM. But the stable config system works fine even with limited RAM. It would be great if I could also use the new one on the same machine since it looks much cleaner.

Thanks in advance!

Machine spec

OS: Ubuntu 22.04 on WSL2 on Windows 11 22H2
RAM: 16 GB (approx. 8 GB for WSL2)
GPU: RTX 4070 Ti
Cuda: 11.7
Torch: 1.13.1
Kaolin: 0.13.0

Reproduction steps

Install Kaolin Wisp with requirements_app.txt
pip install pyopengl_accelerate
python app/nerf/main_nerf.py --dataset-path /path/to/lego/ --config app/nerf/configs/nerf_hash.yaml

orperel commented 1 year ago

Hi @barikata1984 ! Sorry for the delayed reply here - I suspect this is due to a configuration change (we set "high quality" as the new default): https://github.com/NVIDIAGameWorks/kaolin-wisp/commit/99639ae60de4d1c6f4f721e3b6d1004e258afa5b#diff-0e84d1aed551f592a75f92bacc6eed1545bdaeb03042d1fb2f6aa17343e5db8bR46

Can you try with a reduced sample-per-ray count?

python app/nerf/main_nerf.py --dataset-path /path/to/lego/ --config app/nerf/configs/nerf_hash.yaml --tracer.num_steps 512

I've also tracked all config updates here: https://kaolin-wisp.readthedocs.io/en/latest/pages/config_system.html#converting-older-configs-up-to-wisp-v1-0-2

barikata1984 commented 1 year ago

Hi @orperel,

Thanks for your response.

I reduced sample-per-ray from 512 to 16, halving the value iteratively but the process got killed.

It looks like something happens when running train_dataset = instantiate(cfg.dataset, transform=dataset_transform) in main_nerf.py. To see it, I added the following lines

+ print("Instantiating dataset_transform")
dataset_transform = instantiate(cfg.dataset_transform)  # SampleRays creates batches of rays from the dataset
+ print("Instantiating train_dataset")
train_dataset = instantiate(cfg.dataset, transform=dataset_transform)  # A Multiview dataset

in app/nerf/main_nerf.py and

+ print("================= Flag 0 =================")
instance = instantiate(config, **overriden_args)
+ print("================= Flag 1 =================")

in wisp/config/utils.py. The output is

$  python app/nerf/main_nerf.py --dataset-path /path/to/lego/ --config app/nerf/configs/nerf_hash.yaml --tracer.num_steps 16
blas
  constructor: OctreeAS.make_dense
  level: 7
grid
  constructor: HashGrid.from_geometric
  feature_dim: 2
  num_lods: 16
  multiscale_type: cat
  feature_std: 0.01
  feature_bias: 0.0
  codebook_bitwidth: 19
  min_grid_res: 16
  max_grid_res: 2048
nef
  constructor: NeuralRadianceField
  pos_embedder: none
  view_embedder: positional
  pos_multires: 10
  view_multires: 4
  position_input: False
  activation_type: relu
  layer_type: linear
  hidden_dim: 64
  num_layers: 1
  prune_density_decay: 0.6
  prune_min_density: 2.956033378250884
tracer
  constructor: PackedRFTracer
  raymarch_type: ray
  num_steps: 16
  step_size: 1.0
  bg_color: black
dataset
  constructor: NeRFSyntheticDataset
  dataset_path: ../nerf_data/lego/
  split: train
  bg_color: white
  mip: 0
  dataset_num_workers: -1
  transform: None
dataset_transform
  constructor: SampleRays
  num_samples: 4096
trainer
  optimizer
    constructor: RMSprop
    lr: 0.001
    alpha: 0.99
    eps: 1e-08
    weight_decay: 0.0
    momentum: 0.0
  dataloader
    batch_size: 1
    num_workers: 0
  exp_name: nerf-hash
  mode: train
  max_epochs: 100
  save_every: -1
  save_as_new: False
  model_format: full
  render_every: -1
  valid_every: -1
  enable_amp: True
  profile_nvtx: True
  grid_lr_weight: 100.0
  prune_every: 100
  random_lod: False
  rgb_lambda: 1.0
tracker
  tensorboard
    constructor: _Tensorboard
    log_dir: _results/logs/runs
  wandb
    constructor: _WandB
    project: wisp-nerf
    entity: None
    run_name: None
    job_type: train
    sync_tensorboard: True
  visualizer
    constructor: OfflineRenderer
    render_res: (1024, 1024)
    render_batch: 10000
    shading_mode: rb
    matcap_path: ./data/matcap/Pearl.png
    shadow: False
    ao: False
    perf: False
  vis_camera
    camera_origin: (-3.0, 0.65, -3.0)
    camera_lookat: (0.0, 0.0, 0.0)
    camera_fov: 30.0
    camera_clamp: (0.0, 10.0)
    viz360_num_angles: 20
    viz360_radius: 3.0
    viz360_render_all_lods: False
  enable_tensorboard: True
  enable_wandb: False
  log_dir: _results/logs/runs
log_level: 20
pretrained: None
device: cuda
interactive: True
Instantiating dataset_transform
================= Flag 1 =================
================= Flag 2 =================
Instantiating train_dataset
================= Flag 1 =================
loading data: 100%|████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 30.43it/s]
/home/atsushi/miniconda3/envs/wisp/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Killed

Do you have any other ideas to clear this issue?

tovacinni commented 1 year ago

Hi @barikata1984 thanks for this bug report.

I ran some memory profiling and indeed the main branches uses upwards of 14GB of resident memory at peak, which really shouldn't be the case.

I dug into the issue a bit and I fixed some benign issues in: https://github.com/NVIDIAGameWorks/kaolin-wisp/pull/164

Now the resident memory at least according to my profiling is 8GB (so a 6GB reduction). If you want further savings, I would pass in --valid-every -1 to disable validation, since the validation dataset takes around 3GB ish of memory.

Let me know if this works for you!

barikata1984 commented 1 year ago

Hi @tovacinni, thanks a lot for the solution! As you suggested, --valid-every -1 worked while with validation running still got killed due to RAM shortage. I will try again on a different PC with sufficient RAM

NVIDIAGameWorks / kaolin-wisp