Belval / TextRecognitionDataGenerator

A synthetic data generator for text recognition
MIT License
3.26k stars 973 forks source link

Generator is very slow #120

Open Mohamed209 opened 4 years ago

Mohamed209 commented 4 years ago

I am trying to generate images using generate from string as I have list of strings to generate from , the issue that it's very slow as generating one image per second

Belval commented 4 years ago

That is extremely slow, can you post the command that you used?

Also, what hardware are you using?

Mohamed209 commented 4 years ago

Here's the script I am using to generate data , I am training on very powerful cloud machine with 6 cpu cores and around 50 gb ram https://github.com/Mohamed209/TextRecognitionDataGenerator/blob/receipts_ocr/generate_training_lines.py

Belval commented 4 years ago

I'll try and reproduce the issue on my side, I'll report back soon.

Belval commented 4 years ago

Okay so quickly:

You can run py-spy to see function calls and see what is taking time.

Mohamed209 commented 4 years ago

as workaround i used parallel processing technique to boost generation , gained more speed but its not the optimal as generating my dataset around half million images would take with this new rate about 10 to 12 hours

`if name == "main":

print("started generating arabic lines :)")
Parallel(n_jobs=-1)(delayed(save_lines)(img, lbl)
                    for img, lbl in tqdm(mixed_generator))
print("started generating english lines :)")
Parallel(n_jobs=-1)(delayed(save_lines)(img, lbl)
                    for img, lbl in tqdm(english_generator))`
Belval commented 4 years ago

You initial comment was removed/edited away. But if you really generate images with 40 words/image, having a performance speed of 11-13 imgs/sec is not that bad.

I can try to see if there are low hanging fruits in the code, but since the project does a lot of image manipulations I don't know if I will get a big improvements.

Mohamed209 commented 4 years ago

@Belval i am generating around 40 characters per image not 40 words , but this number is the worst case i have text samples that is much lower than 40 , i feel that if the string to be generated is small in length then processing will be much more fast , but in my case string on average may contain from 10:20 chars so rendering data is slow , i will investigate this more in next few days here is my new full script https://github.com/Mohamed209/TextRecognitionDataGenerator/blob/receipts_ocr/generate_training_lines.py

Belval commented 4 years ago

I see. I never benchmarked each options, so maybe try removing one of these lines and measure the impact of processing time:

distorsion_type=np.random.choice(distorsion_type),
skewing_angle=np.random.choice(skewing_angle),
blur=np.random.choice(blur),
Mohamed209 commented 4 years ago

Ok I will try

Mohamed209 commented 4 years ago

Ok I will try

bzamecnik commented 3 years ago

One reason why the generate() function is slow is that it reloads the TF graph/session for each text sample! It can be easily rewritten to a class which initializes its own graph/session, loads the model once and then it only uses it for predictions. This can save some 1-2 s for each invocation.

class HandwrittenGenerator:

    def __init__(self):
        base_dir = download_model_weights()
        model_dir = os.path.join(base_dir, "handwritten_model")
        path = os.path.join(model_dir, "translation.pkl")
        with open(path, "rb") as file:
            self.translation = pickle.load(file)

        self.graph = tf.Graph()
        self.session = tf.compat.v1.Session(graph=self.graph)
        with self.graph.as_default(), self.session.as_default():
            saver = tf.compat.v1.train.import_meta_graph(os.path.join(model_dir, "model-29.meta"))
            saver.restore(self.session, os.path.join(model_dir, "model-29"))

    def generate(self, text, text_color="black"):
        with self.graph.as_default(), self.session.as_default():
            images = []
            colors = [ImageColor.getrgb(c) for c in text_color.split(",")]
            c1, c2 = colors[0], colors[-1]

            color = "#{:02x}{:02x}{:02x}".format(
                rnd.randint(min(c1[0], c2[0]), max(c1[0], c2[0])),
                rnd.randint(min(c1[1], c2[1]), max(c1[1], c2[1])),
                rnd.randint(min(c1[2], c2[2]), max(c1[2], c2[2])),
            )

            for word in text.split(" "):
                _, window_data, kappa_data, stroke_data, coords = _sample_text(
                    self.session, word, self.translation
                )

                strokes = np.array(stroke_data)
                strokes[:, :2] = np.cumsum(strokes[:, :2], axis=0)
                _, maxx = np.min(strokes[:, 0]), np.max(strokes[:, 0])
                miny, maxy = np.min(strokes[:, 1]), np.max(strokes[:, 1])

                fig, ax = plt.subplots(1, 1)
                fig.patch.set_visible(False)
                ax.axis("off")

                for stroke in _split_strokes(_cumsum(np.array(coords))):
                    plt.plot(stroke[:, 0], -stroke[:, 1], color=color)

                fig.patch.set_alpha(0)
                fig.patch.set_facecolor("none")

                canvas = plt.get_current_fig_manager().canvas
                canvas.draw()

                s, (width, height) = canvas.print_to_buffer()
                image = Image.frombytes("RGBA", (width, height), s)
                mask = Image.new("RGB", (width, height), (0, 0, 0))

                images.append(_crop_white_borders(image))

                plt.close()

            return _join_images(images), mask

Then call as:

# initialize once - 1-2 s
generator = HandwrittenGenerator()
for text in texts:
    # < 1 sec or more, depending on text length
    img, mask = generator.generate('your text here', 'black')
    # ...

As for running this in parallel, I'm afraid it could work only when its IO dominated (which likely is). Otherwise multiple TF sessions would compete for resources. Note that in Docker TF detect CPU core count based on the host machine, not the container quota, which may result in too many threads competing for limited resources. This can be detected and set to the session config.

The other reason is that it calls the TF session.run() for each stroke in a loop. I'm not sure if this can be improved to run the whole prediction at once.

Another thing is there's no batching. Eg. for many texts we could perform the steps in parallel. But the code would get more complex.