DrSleep / tensorflow-deeplab-resnet

DeepLab-ResNet rebuilt in TensorFlow
MIT License
1.25k stars 429 forks source link

How much is the GPU memory requirements? #39

Closed Fansiee closed 7 years ago

Fansiee commented 7 years ago

My GPU is TITAN X(Pascal)(12GiB). When I train the model using 'python train.py', there is a warning: "Ran out of memory trying to allocate 3.19GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.". I wonder if the required GPU memory is 15GiB? But even I ran out of memory, it still runs. And it seems not damage the performance.

Have you ever met this kind of problem?

cv@cv:~/tf/tf-deeplab-resnet$ python train.py I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:02:00.0 Total memory: 11.90GiB Free memory: 11.61GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x54536c0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:03:00.0 Total memory: 11.90GiB Free memory: 11.76GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x54574e0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:82:00.0 Total memory: 11.90GiB Free memory: 11.76GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x545b300 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:83:00.0 Total memory: 11.90GiB Free memory: 11.76GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 2 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 2 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 1 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 1 I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y N N I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y N N I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: N N Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: N N Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:02:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: TITAN X (Pascal), pci bus id: 0000:03:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: TITAN X (Pascal), pci bus id: 0000:82:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: TITAN X (Pascal), pci bus id: 0000:83:00.0) Restored model parameters from /home/cv/tf/tf-deeplab-resnet/deeplab_resnet_init.ckpt W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.19GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.19GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. The checkpoint has been created. step 0 loss = 4.333, (12.162 sec/step) step 1 loss = 3.095, (1.228 sec/step) step 2 loss = 3.916, (0.890 sec/step) step 3 loss = 3.019, (0.836 sec/step) step 4 loss = 3.412, (0.824 sec/step) step 5 loss = 2.240, (0.909 sec/step) step 6 loss = 2.528, (0.962 sec/step)

DrSleep commented 7 years ago

Yes, this warning pops up from time to time when I am trying to increase the batch size, but it doesn't affect the model performance.

In my case, when_ running on Titan X (12 Gb), it doesn't produce any warnings with default parameters (batch_size=10, image_size=321) and takes 11708 Mb of the GPU memory.