faustomorales / keras-ocr

A packaged and flexible version of the CRAFT text detector and Keras CRNN recognition model.
https://keras-ocr.readthedocs.io/
MIT License
1.38k stars 355 forks source link

recognizer possible bug #149

Open VasilisStavrianoudakis opened 3 years ago

VasilisStavrianoudakis commented 3 years ago

Hello,

Let me start by saying thank you for this great pipeline.

I have noticed something strange in the get_batch_generator function of the recognizer. If your batch size is, for example, 2 the image_generator gets called 3 times. I believe that this line causes the problem:

https://github.com/faustomorales/keras-ocr/blob/71fbec8c163ae035dfb89a8b936ac48385bb7482/keras_ocr/recognition.py#L362

I have also created a toy example:

import random

def gen():
    while True:
        print("Generator got called")
        yield random.random()

r_gen = gen()
batch_size = 2

b = [sample for sample, _ in zip(r_gen, range(batch_size))]
print(b)

print("=" * 100)

b = [next(r_gen) for n in range(batch_size)]
print(b)

The output is:

Generator got called
Generator got called
Generator got called
[0.4160141123512153, 0.8948171240884449]
===================================
Generator got called
Generator got called
[0.8689812892217589, 0.13292716281754136]

I am not sure if this is a bug. In any case, I wanted to ask you if this is the expected behavior. Maybe the second approach (without the zip) is the correct one?

Thank you again!

VasilisStavrianoudakis commented 3 years ago

I realized that I did not provide further info as to why this may be a bug.

Let's say that you have 4 training data: [img1, img2, img3, img4], a batch_size=2 and epochs=2. Your steps_per_epoch = len(training_data) / batch_size = 2.

Epoch 1:
Step 1/2:
batch = [img1, img2]. But now the image_generator gets called one more time so it yields img3 as well.
Step 2/2:
batch = [img4, img1]. Again image_generator gets called one more time and it yields img2.

Because the image_generator yielded img2 as the last image, the Epoch 2 now starts with:

Epoch 2:
Step 1/2:
batch = [img3, img4]. Yields one more image -> img1
Step 2/2:
batch = [img2, img3]

The main problem is that during one epoch the model may not see all the available data. The other problem is that each batch does not contain the same data across all epochs.

Did I miss something?