NVlabs / eg3d

Other
3.23k stars 361 forks source link

Cuda memory error #52

Open diamond0910 opened 2 years ago

diamond0910 commented 2 years ago

I train on 8 sheets of 24G TITAN RTX. Strangely, when I set the 4-card batch size to 16 for training it works fine. But when I set the batch size of 8 cards to 32, the following error will be reported. Why?

Traceback (most recent call last): [285/1697] File "python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, args)
File "eg3d/train.py", line 52, in subprocess_fn
training_loop.training_loop(rank=rank, c)
File "eg3d/training/training_loop.py", line 285, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
File "eg3d/training/loss.py", line 121, in accumulate_gradients
gen_img, _gen_ws = self.run_G(gen_z, gen_c, swapping_prob=swapping_prob, neural_rendering_resolution=neural_rendering_resolution)
File "eg3d/training/loss.py", line 70, in run_G
gen_output = self.G.synthesis(ws, c, neural_rendering_resolution=neural_rendering_resolution, update_emas=update_emas)
File "eg3d/training/triplane.py", line 89, in synthesis
sr_image = self.superresolution(rgb_image, feature_image, ws, noise_mode=self.rendering_kwargs['superresolution_noise_mode'],
{k:synthesis_kwargs[k] for k in synthesis_kwargs.keys() if k != 'noise_mode'})
File "python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(
input, kwargs)
File "eg3d/training/superresolution.py", line 289, in forward
x, rgb = self.block1(x, rgb, ws,
block_kwargs)
File "python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(input, kwargs)
File "eg3d/training/networks_stylegan2.py", line 448, in forward
x = self.conv1(x, next(w_iter), fused_modconv=fused_modconv,
layer_kwargs)
File "python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(
input, **kwargs)
File "eg3d/training/networks_stylegan2.py", line 329, in forward
x = bias_act.bias_act(x, self.bias.to(x.dtype), act=self.activation, gain=act_gain, clamp=act_clamp)
File "eg3d/torch_utils/ops/bias_act.py", line 87, in bias_act
return _bias_act_cuda(dim=dim, act=act, alpha=alpha, gain=gain, clamp=clamp).apply(x, b)
File "eg3d/torch_utils/ops/bias_act.py", line 152, in forward
y = _plugin.bias_act(x, b, _null_tensor, _null_tensor, _null_tensor, 0, dim, spec.cuda_idx, alpha, gain, clamp) RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 23.70 GiB total capacity; 7.11 GiB already allocated; 21.81 MiB free; 7.83 GiB rese rved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Ma nagement and PYTORCH_CUDA_ALLOC_CONF

LoickCh commented 2 years ago

Hi, Can you look at ? #58

Michaelsqj commented 2 years ago

I was also using 8 24G GPU, and I could set the total batch size to 64 without having memory issue. The dataset I used was shapenet cars.

Michaelsqj commented 2 years ago

I just reproduced your error on another machine. It turns out that nvcc isn't correctly installed on that machine. Not sure if that'll be the same cause for your problem.