aharley / pips

Particle Video Revisited
MIT License
571 stars 51 forks source link

About training PIPs #14

Closed HarryHsing closed 1 year ago

HarryHsing commented 1 year ago

Hi Dr. Harley,

I have met some problems when training PIPs with the default settings:

  1. I always see this warning: warning: sampling failed.
  2. I wonder how much GPU memory is needed. I tried to train PIPs with default settings on (1) a single RTX 3090 GPU (24GB) with python train.py, and (2) four RTX 2080 GPUs (11GB each) with python train.py --horz_flip=True --vert_flip=True --device_ids=[0,1,2,3], but both failed with CUDA out of memory. I tried to change the Batch Size to 1 but seems that it did not help.

Thank you very much for your support!

HarryHsing commented 1 year ago

Here is the log when I train PIPs on a single RTX 3090 GPU (24GB) with Batch Size 4:

nohup: ignoring input
model_name 16hv_8_768_I4_5e-4_A_debug_15:28:35
loading FlyingThingsDataset...
..................................................found 13085 samples in ../datasets/flyingthings (dset=TRAIN, subset=all, version=ad)
loading occluders...
..................................................found 7856 occluders in ../datasets/flyingthings (dset=TRAIN, subset=all, version=al)
not using augs in val
loading FlyingThingsDataset...
..................................................found 2542 samples in ../datasets/flyingthings (dset=TEST, subset=all, version=ad)
loading occluders...
..................................................found 1631 occluders in ../datasets/flyingthings (dset=TEST, subset=all, version=al)
warning: sampling failed
warning: sampling failed
Traceback (most recent call last):
  File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 421, in <module>
    Fire(main)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 324, in main
    total_loss, metrics = run_model(model, sample, device, I, horz_flip, vert_flip, sw_t, is_train=True)
  File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 82, in run_model
    preds, preds_anim, vis_e, stats = model(trajs_g[:,0], rgbs, coords_init=None, iters=I, trajs_g=trajs_g, vis_g=vis_g, valids=valids, sw=sw, is_train=is_train)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xingzhenghao/PycharmProjects/pips/nets/pips.py", line 443, in forward
    fmaps_ = self.fnet(rgbs_)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xingzhenghao/PycharmProjects/pips/nets/pips.py", line 265, in forward
    b = self.layer2(a)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xingzhenghao/PycharmProjects/pips/nets/pips.py", line 175, in forward
    y = self.relu(self.norm2(self.conv2(y)))
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/instancenorm.py", line 72, in forward
    return self._apply_instance_norm(input)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/instancenorm.py", line 32, in _apply_instance_norm
    return F.instance_norm(
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/functional.py", line 2483, in instance_norm
    return torch.instance_norm(
RuntimeError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 23.68 GiB total capacity; 20.64 GiB already allocated; 204.12 MiB free; 20.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is the log when I train PIPs on a single RTX 3090 GPU (24GB) with Batch Size 1:

nohup: ignoring input
model_name 4hv_8_768_I4_5e-4_A_debug_22:00:29
loading FlyingThingsDataset...
..................................................found 13085 samples in ../datasets/flyingthings (dset=TRAIN, subset=all, version=ad)
loading occluders...
..................................................found 7856 occluders in ../datasets/flyingthings (dset=TRAIN, subset=all, version=al)
not using augs in val
loading FlyingThingsDataset...
..................................................found 2542 samples in ../datasets/flyingthings (dset=TEST, subset=all, version=ad)
loading occluders...
..................................................found 1631 occluders in ../datasets/flyingthings (dset=TEST, subset=all, version=al)
Traceback (most recent call last):
  File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 421, in <module>
    Fire(main)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 324, in main
    total_loss, metrics = run_model(model, sample, device, I, horz_flip, vert_flip, sw_t, is_train=True)
  File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 82, in run_model
    preds, preds_anim, vis_e, stats = model(trajs_g[:,0], rgbs, coords_init=None, iters=I, trajs_g=trajs_g, vis_g=vis_g, valids=valids, sw=sw, is_train=is_train)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xingzhenghao/PycharmProjects/pips/nets/pips.py", line 503, in forward
    fcp = torch.zeros((B,S,N,H8,W8), dtype=torch.float32, device=device) # B,S,N,H8,W8
RuntimeError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.68 GiB total capacity; 20.31 GiB already allocated; 250.75 MiB free; 20.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
aharley commented 1 year ago

OK great, these issues look solvable.

I always see this warning: warning: sampling failed.

This is OK! As mentioned in the readme, you can probably just ignore this.

The current reference model is trained with very big GPUs, with 80G memory. 80G is not necessary to train a good model, but anyway it helped me avoid issues like you're facing here.

There are a few things that you can do to reduce memory:

I think if you choose N=256, you will be able to keep the flips True, and train on your 4 2080s with B=1. Due to the flips, you will get an effective batch size of 4, and each GPU will process 256 particles.

HarryHsing commented 1 year ago

OK great, these issues look solvable.

I always see this warning: warning: sampling failed.

This is OK! As mentioned in the readme, you can probably just ignore this.

The current reference model is trained with very big GPUs, with 80G memory. 80G is not necessary to train a good model, but anyway it helped me avoid issues like you're facing here.

There are a few things that you can do to reduce memory:

  • horz_flip=False
  • vert_flip=False
  • N=128 (or any number really, but smaller than 768)

I think if you choose N=256, you will be able to keep the flips True, and train on your 4 2080s with B=1. Due to the flips, you will get an effective batch size of 4, and each GPU will process 256 particles.

Thank you very much for your kind assistance! Dr. Harley. It works now.

HarryHsing commented 1 year ago

OK great, these issues look solvable.

I always see this warning: warning: sampling failed.

This is OK! As mentioned in the readme, you can probably just ignore this.

The current reference model is trained with very big GPUs, with 80G memory. 80G is not necessary to train a good model, but anyway it helped me avoid issues like you're facing here.

There are a few things that you can do to reduce memory:

  • horz_flip=False
  • vert_flip=False
  • N=128 (or any number really, but smaller than 768)

I think if you choose N=256, you will be able to keep the flips True, and train on your 4 2080s with B=1. Due to the flips, you will get an effective batch size of 4, and each GPU will process 256 particles.

Hi, Dr. Harley. May I know how many A100 GPUs you used here? Thanks!

aharley commented 1 year ago

8 gpus for the best model