faustomorales / keras-ocr

A packaged and flexible version of the CRAFT text detector and Keras CRNN recognition model.
https://keras-ocr.readthedocs.io/
MIT License
1.36k stars 347 forks source link

Filling up RAM #244

Open v-artur opened 9 months ago

v-artur commented 9 months ago

I am using keras-ocr to extract the amount of words and the words themselves from images.

My dataset has around 58000 images and I feed them one by one into the pipeline, using a for loop, and I expand 2 lists with the results along the way. The images are loaded into the disk memory of colab.

For some reason after about 2.5-3 thousand images, colab's 12 GB of RAM just gets used up.

For this reason after every 1000 iterations I dump out a json file with the current lists into my drive.

I restart the kernel and load the lists from json backup saves and they don't even tickle the RAM.

It's quite annoying to reastart the kernel after every 2.5k images, and gc.collect() does nothing.

Is it my code or does the ocr fill up RAM despite any attempts to clear it?

here are the important parts of the code:

pipeline = keras_ocr.pipeline.Pipeline()

t0 = time() list_of_wc = [] list_of_words = []

output_path = "..."

for k, img in enumerate(image_df.image_name): if k%500 == 0: print(time() - t0, k)

img_cv = cv2.imread(r"..." + img, cv2.IMREAD_COLOR) image_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)

prediction_groups = pipeline.recognize([image_rgb])

list_of_wc.append(len(prediction_groups[0]))

list_of_words.append([words[0] for words in prediction_groups[0]])

if k == 999: with open(output_path + f'kerasocr_1000.json','w') as f: json.dump({'num_of_words':list_of_wc, 'the_words': list_of_words}, f)