Closed rogerhcheng closed 3 years ago
It was my fault, because I was running the train_TSM_Something_v1.sh script as-is. The original value for batch size is 48, when I dropped it down to 36 it started working fine. It seems like my GPU doesn't have quite as much memory as the author's GPU.
It was my fault, because I was running the train_TSM_Something_v1.sh script as-is. The original value for batch size is 48, when I dropped it down to 36 it started working fine. It seems like my GPU doesn't have quite as much memory as the author's GPU.
Did you ever managed to get this network work as accurate as the paper stated?
I am trying to run the training on something-something-v1 with a Nvidia RTX 3080, and the latest PyTorch Nvidia Docker image, and I get this out of memory error. It is reproducible each time I run.
Any ideas? I know that I am not using the same exact configuration as the original author, but I don't think I can downgrade, because the RTX 3080 doesn't support CUDA 9.0.
Thanks in advance.
pretrained_parts: finetune group: first_conv_weight has 1 params, lr_mult: 1, decay_mult: 1 group: first_conv_bias has 0 params, lr_mult: 2, decay_mult: 0 group: normal_weight has 29 params, lr_mult: 1, decay_mult: 1 group: normal_bias has 1 params, lr_mult: 2, decay_mult: 0 group: BN scale/shift has 60 params, lr_mult: 1, decay_mult: 0 group: custom_ops has 0 params, lr_mult: 1, decay_mult: 1 group: lr5_weight has 0 params, lr_mult: 1, decay_mult: 1 group: lr10_bias has 0 params, lr_mult: 2, decay_mult: 0 100 No BN layer Freezing. Traceback (most recent call last): File "../main_something.py", line 442, in
main()
File "../main_something.py", line 211, in main
temperature = train(train_loader, model, criterion, optimizer, epoch)
File "../main_something.py", line 273, in train
output = model(input_var, temperature)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in forward
return self.module(*inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(input, kwargs)
File "/Documents/MotionSqueeze/models.py", line 354, in forward
base_out = self.base_model(input_var, temperature)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, kwargs)
File "/Documents/MotionSqueeze/resnet_TSM.py", line 430, in forward
flow_1, match_v = self.flow_computation(x, temperature=temperature)
File "/Documents/MotionSqueeze/resnet_TSM.py", line 406, in flow_computation
match = self.matching_layer(x_pre, x_post) # (BT-1group, HW, HW)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, *kwargs) File "/Documents/MotionSqueeze/resnet_TSM.py", line 164, in forward corr = self.correlation_sampler(feature1, feature2) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(input, kwargs) File "/opt/conda/lib/python3.8/site-packages/spatial_correlation_sampler-0.3.0-py3.8-linux-x86_64.egg/spatial_correlation_sampler/spatial_correlation_sampler.py", line 105, in forward return SpatialCorrelationSamplerFunction.apply(input1, input2, self.kernel_size, File "/opt/conda/lib/python3.8/site-packages/spatial_correlation_sampler-0.3.0-py3.8-linux-x86_64.egg/spatial_correlation_sampler/spatial_correlation_sampler.py", line 66, in forward output = correlation.forward(input1, input2, RuntimeError: CUDA out of memory. Tried to allocate 228.00 MiB (GPU 0; 9.78 GiB total capacity; 8.19 GiB already allocated; 34.12 MiB free; 8.52 GiB reserved in total by PyTorch)