amineKammah / ensimag-sdtd

0 stars 0 forks source link

Integrating Kafka #5

Open amineKammah opened 3 years ago

amineKammah commented 3 years ago

Background: For now, the spark job is set to run on top of Kubernetes. Python reads about 35 images locally, each image get processed using Tesseract using Spark, the result is then collected in a python list that gets printed in stdout. Instead of reading these images from the local disk, the goal is to read them from a Kafka topic.

Done:

To DO:

Logs:

amineKammah commented 3 years ago

Set up tutorial: https://medium.com/@JinnaBalu/kafka-cluster-on-amezon-eks-cluster-5850d67ae723

amineKammah commented 3 years ago

We are trying to implement a real time application. But processing an image once received is not going to allow us to use Spark. We want to group a number of images before processing them, to allow us to take advantage of distributed computing. The initial implemented idea is to get gather 100 images in a list, once there, send these images to the ocr_service. If a timeout passes and we still did not receive all the 100 images, send however many images we already collected.