Open amineKammah opened 3 years ago
We are trying to implement a real time application. But processing an image once received is not going to allow us to use Spark. We want to group a number of images before processing them, to allow us to take advantage of distributed computing. The initial implemented idea is to get gather 100 images in a list, once there, send these images to the ocr_service. If a timeout passes and we still did not receive all the 100 images, send however many images we already collected.
Background: For now, the spark job is set to run on top of Kubernetes. Python reads about 35 images locally, each image get processed using Tesseract using Spark, the result is then collected in a python list that gets printed in stdout. Instead of reading these images from the local disk, the goal is to read them from a Kafka topic.
Done:
To DO:
Logs: