Closed HarryHsing closed 1 year ago
Here is the log when I train PIPs on a single RTX 3090 GPU (24GB) with Batch Size 4:
nohup: ignoring input
model_name 16hv_8_768_I4_5e-4_A_debug_15:28:35
loading FlyingThingsDataset...
..................................................found 13085 samples in ../datasets/flyingthings (dset=TRAIN, subset=all, version=ad)
loading occluders...
..................................................found 7856 occluders in ../datasets/flyingthings (dset=TRAIN, subset=all, version=al)
not using augs in val
loading FlyingThingsDataset...
..................................................found 2542 samples in ../datasets/flyingthings (dset=TEST, subset=all, version=ad)
loading occluders...
..................................................found 1631 occluders in ../datasets/flyingthings (dset=TEST, subset=all, version=al)
warning: sampling failed
warning: sampling failed
Traceback (most recent call last):
File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 421, in <module>
Fire(main)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 324, in main
total_loss, metrics = run_model(model, sample, device, I, horz_flip, vert_flip, sw_t, is_train=True)
File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 82, in run_model
preds, preds_anim, vis_e, stats = model(trajs_g[:,0], rgbs, coords_init=None, iters=I, trajs_g=trajs_g, vis_g=vis_g, valids=valids, sw=sw, is_train=is_train)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingzhenghao/PycharmProjects/pips/nets/pips.py", line 443, in forward
fmaps_ = self.fnet(rgbs_)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingzhenghao/PycharmProjects/pips/nets/pips.py", line 265, in forward
b = self.layer2(a)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingzhenghao/PycharmProjects/pips/nets/pips.py", line 175, in forward
y = self.relu(self.norm2(self.conv2(y)))
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/instancenorm.py", line 72, in forward
return self._apply_instance_norm(input)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/instancenorm.py", line 32, in _apply_instance_norm
return F.instance_norm(
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/functional.py", line 2483, in instance_norm
return torch.instance_norm(
RuntimeError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 23.68 GiB total capacity; 20.64 GiB already allocated; 204.12 MiB free; 20.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Here is the log when I train PIPs on a single RTX 3090 GPU (24GB) with Batch Size 1:
nohup: ignoring input
model_name 4hv_8_768_I4_5e-4_A_debug_22:00:29
loading FlyingThingsDataset...
..................................................found 13085 samples in ../datasets/flyingthings (dset=TRAIN, subset=all, version=ad)
loading occluders...
..................................................found 7856 occluders in ../datasets/flyingthings (dset=TRAIN, subset=all, version=al)
not using augs in val
loading FlyingThingsDataset...
..................................................found 2542 samples in ../datasets/flyingthings (dset=TEST, subset=all, version=ad)
loading occluders...
..................................................found 1631 occluders in ../datasets/flyingthings (dset=TEST, subset=all, version=al)
Traceback (most recent call last):
File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 421, in <module>
Fire(main)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 324, in main
total_loss, metrics = run_model(model, sample, device, I, horz_flip, vert_flip, sw_t, is_train=True)
File "/home/xingzhenghao/PycharmProjects/pips/train.py", line 82, in run_model
preds, preds_anim, vis_e, stats = model(trajs_g[:,0], rgbs, coords_init=None, iters=I, trajs_g=trajs_g, vis_g=vis_g, valids=valids, sw=sw, is_train=is_train)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/xingzhenghao/anaconda3/envs/pips/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xingzhenghao/PycharmProjects/pips/nets/pips.py", line 503, in forward
fcp = torch.zeros((B,S,N,H8,W8), dtype=torch.float32, device=device) # B,S,N,H8,W8
RuntimeError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.68 GiB total capacity; 20.31 GiB already allocated; 250.75 MiB free; 20.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
OK great, these issues look solvable.
I always see this warning: warning: sampling failed.
This is OK! As mentioned in the readme, you can probably just ignore this.
The current reference model is trained with very big GPUs, with 80G memory. 80G is not necessary to train a good model, but anyway it helped me avoid issues like you're facing here.
There are a few things that you can do to reduce memory:
horz_flip=False
vert_flip=False
N=128
(or any number really, but smaller than 768)I think if you choose N=256, you will be able to keep the flips True
, and train on your 4 2080s with B=1. Due to the flips, you will get an effective batch size of 4, and each GPU will process 256 particles.
OK great, these issues look solvable.
I always see this warning: warning: sampling failed.
This is OK! As mentioned in the readme, you can probably just ignore this.
The current reference model is trained with very big GPUs, with 80G memory. 80G is not necessary to train a good model, but anyway it helped me avoid issues like you're facing here.
There are a few things that you can do to reduce memory:
horz_flip=False
vert_flip=False
N=128
(or any number really, but smaller than 768)I think if you choose N=256, you will be able to keep the flips
True
, and train on your 4 2080s with B=1. Due to the flips, you will get an effective batch size of 4, and each GPU will process 256 particles.
Thank you very much for your kind assistance! Dr. Harley. It works now.
OK great, these issues look solvable.
I always see this warning: warning: sampling failed.
This is OK! As mentioned in the readme, you can probably just ignore this.
The current reference model is trained with very big GPUs, with 80G memory. 80G is not necessary to train a good model, but anyway it helped me avoid issues like you're facing here.
There are a few things that you can do to reduce memory:
horz_flip=False
vert_flip=False
N=128
(or any number really, but smaller than 768)I think if you choose N=256, you will be able to keep the flips
True
, and train on your 4 2080s with B=1. Due to the flips, you will get an effective batch size of 4, and each GPU will process 256 particles.
Hi, Dr. Harley. May I know how many A100 GPUs you used here? Thanks!
8 gpus for the best model
Hi Dr. Harley,
I have met some problems when training PIPs with the default settings:
warning: sampling failed
.python train.py
, and (2) four RTX 2080 GPUs (11GB each) withpython train.py --horz_flip=True --vert_flip=True --device_ids=[0,1,2,3]
, but both failed with CUDA out of memory. I tried to change the Batch Size to 1 but seems that it did not help.Thank you very much for your support!