Out of Memory if the batch size is larger than 5

JackieLeeTHU11 commented 7 years ago

I use GTX 1080 (8G). The size of images in train datasets are 640*480. When the batch size is larger than 5, GPU will out of memory. The weight is 515MB, so is there something wrong? Second, the training is very slow. If batch size is 2, five seconds are needed. Do you know the reason, thanks!@MarvinTeichmann

MarvinTeichmann commented 7 years ago

When the batch size is larger than 5, GPU will out of memory. The weight is 515MB, so is there something wrong?

No, nothing is wrong here. Storing gradients during backprop need a lot of memory. I usually train with batch size of 1.

Second, the training is very slow. If batch size is 2, five seconds are needed.

I train images of size '1248 x 364' at a speed of '2.0 imgs/sec'. Using a Titan X. So it should be slightly faster. Check whether you are bottleneck by I/O (i.e. loading input images). To increase training speed of GTX 1080 you could try using Cuda 8.0. Only Cuda 8.0 supports the new features of Pascal GPUs, so Cuda 8.0 should give you a nice speed-up on your 1080.

JackieLeeTHU11 commented 7 years ago

Thanks for your reply! I use Cuda 8.0 already. My GPU usage changes from 0 to 100% frequently, so there should be something wrong there. I found if I do not config allow_soft_device=True，the code could not find device and will pop out Errors, even though I set tf.device('/GPU:0'). @MarvinTeichmann

I found ScalarSummary and HistogramSummary could only run on CPU in this code, even though i set tf.device('/gpu:0') and config 'allow_soft_device=True'. I think this is why it is slow and the usage of GPU is not stable (change from 0 to 100% frequently). Would you please help on this? Thanks a lot in advance!

    with tf.device('/gpu:0'):
        self.conv1_1 = self._conv_layer(bgr, "conv1_1")
        self.conv1_2 = self._conv_layer(self.conv1_1, "conv1_2")
        self.pool1 = self._max_pool(self.conv1_2, 'pool1', debug)

        self.conv2_1 = self._conv_layer(self.pool1, "conv2_1")
        self.conv2_2 = self._conv_layer(self.conv2_1, "conv2_2")
        self.pool2 = self._max_pool(self.conv2_2, 'pool2', debug)

        self.conv3_1 = self._conv_layer(self.pool2, "conv3_1")
        self.conv3_2 = self._conv_layer(self.conv3_1, "conv3_2")
        self.conv3_3 = self._conv_layer(self.conv3_2, "conv3_3")
        self.pool3 = self._max_pool(self.conv3_3, 'pool3', debug)

        self.conv4_1 = self._conv_layer(self.pool3, "conv4_1")
        self.conv4_2 = self._conv_layer(self.conv4_1, "conv4_2")
        self.conv4_3 = self._conv_layer(self.conv4_2, "conv4_3")
        self.pool4 = self._max_pool(self.conv4_3, 'pool4', debug)

        self.conv5_1 = self._conv_layer(self.pool4, "conv5_1")
        self.conv5_2 = self._conv_layer(self.conv5_1, "conv5_2")
        self.conv5_3 = self._conv_layer(self.conv5_2, "conv5_3")
        self.pool5 = self._max_pool(self.conv5_3, 'pool5', debug)

        self.fc6 = self._fc_layer(self.pool5, "fc6")

        if train:
            self.fc6 = tf.nn.dropout(self.fc6, 0.5)

        self.fc7 = self._fc_layer(self.fc6, "fc7")
        if train:
            self.fc7 = tf.nn.dropout(self.fc7, 0.5)

        if random_init_fc8:
            self.score_fr = self._score_layer(self.fc7, "score_fr",
                                              num_classes)
        else:
            self.score_fr = self._fc_layer(self.fc7, "score_fr",
                                           num_classes=num_classes,
                                           relu=False)

        self.pred = tf.argmax(self.score_fr, dimension=3)

        self.upscore2 = self._upscore_layer(self.score_fr,
                                            shape=tf.shape(self.pool4),
                                            num_classes=num_classes,
                                            debug=debug, name='upscore2',
                                            ksize=4, stride=2)
        self.score_pool4 = self._score_layer(self.pool4, "score_pool4",
                                             num_classes=num_classes)
        self.fuse_pool4 = tf.add(self.upscore2, self.score_pool4)

        self.upscore4 = self._upscore_layer(self.fuse_pool4,
                                            shape=tf.shape(self.pool3),
                                            num_classes=num_classes,
                                            debug=debug, name='upscore4',
                                            ksize=4, stride=2)
        self.score_pool3 = self._score_layer(self.pool3, "score_pool3",
                                             num_classes=num_classes)
        self.fuse_pool3 = tf.add(self.upscore4, self.score_pool3)

        self.upscore32 = self._upscore_layer(self.fuse_pool3,
                                             shape=tf.shape(bgr),
                                             num_classes=num_classes,
                                             debug=debug, name='upscore32',
                                             ksize=16, stride=8)

        self.pred_up = tf.argmax(self.upscore32, dimension=3)

shaneahmed commented 7 years ago

If you are using HDF file format or queue runners to read images using tensorflow, increasing the batchsize doesnot make things faster, as tensorflow preloads the data before processing. This is why your GPU usage will be 100% even if your batchSize is 1. Try batch size 1 and then 2 and increment and you will see there is not much difference by increasing the batchsize. My suggestion would be to keep the batchsize to low as there is no way around to compensate for the amount of memory you have on your GPU.

JackieLeeTHU11 commented 7 years ago

Thanks, and I will try your suggestions! @Shaanraza

MarvinTeichmann commented 7 years ago

My GPU usage changes from 0 to 100% frequently, so there should be something wrong there

Sounds like a I/O issue. Are you loading and processing the data in parallel? If not you can expect, that the GPU gets bored during while it waits for another batch. If you do try to increase the number of threads and possible distributing the data among several hard drives. If the disk is the bottleneck you can also try to get a faster SDD/HDD.

ScalarSummary and HistogramSummary could only run on CPU in this code

ScalarSummary and HistogramSummary are not processed unless a corresponding summary_op (e.g. tf.merge_all_summaries) is run. You do not need to run summary_op every iteration. Just run it every 100th iteration and you should be fine. Have not tried to but summary generation on gpu, so can't really help you with that, sry.

JackieLeeTHU11 commented 7 years ago

@MarvinTeichmann "You do not need to run summary_op every iteration. " Your suggestion really works and the GPU usage keeps from 90% to 100% . Thank you very much for your great work and patient explain! And I will try to loading and processing the data in parallel with different threads. I found the loss is easy to be nan if the learning rate is large than 1e-6.

MarvinTeichmann / tensorflow-fcn

Out of Memory if the batch size is larger than 5 #11