ZJULearning / pixel_link

Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018
MIT License
768 stars 255 forks source link

tf.contrib.slim.learning.train() relationship between batch size, steps, and epochs? #136

Closed mattroos closed 5 years ago

mattroos commented 5 years ago

I'm having trouble finding answers to this in the TF documentation or code. What is the definition of a 'step' in the context of tf.contrib.slim.learning.train()? The ICDAR2015 dataset has 1000 training images. Does a step mean that 1000 images were processed (an 'epoch,' in most people's terminology)? Or that a single batch (e.g., 3*24 in the Pixel Link paper) was processed? Or something else?

Relatedly, if I'm using a single GPU with IMG_PER_GPU set to 16, then the batch size will be 16. The pretrained model was first trained on 100 steps with learning rate of 1e-3 and a batch size of 72. What should I set the number of step to with my single GPU and batch size of 16, to get the equivalent number of images train during this initial learning rate part of the training?

mattroos commented 5 years ago

Speculating on an answer to my own question, based on the docstring for the train() function, a step is a gradient step, e.g., one update of the parameters based on the loss for a batch. So to get training somewhat equivalent to 100 steps on 3 GPUs at 24 images per GPU, using only 1 GPU and 16 images per GPU instead, we'd need to execute (3*24)/(1*16)*100 = 450 steps. In that case we'll have trained on the same number of samples as for the 3 GPU case. Of course, results could be quite different since we'll have made 4.5x more gradient updates (steps), albeit with noisier gradients (in some sense, due to the smaller batch size).