Integrating Kafka - Githubissues

amineKammah commented 3 years ago

Background: For now, the spark job is set to run on top of Kubernetes. Python reads about 35 images locally, each image get processed using Tesseract using Spark, the result is then collected in a python list that gets printed in stdout. Instead of reading these images from the local disk, the goal is to read them from a Kafka topic.

Done:

Run ZooKeeper pod and a Kafka pod on Kubernetes
Build a Kafka Producter and a Consumer
Connect Kafka Consumer to Tesseract
Connect Kafka Consumer to Spark Streaming

To DO:

Connect Kafka Producer to Spark Streaming

Logs:

The zoo-keeper and kafka services cannot get an external IP service. Probably Kubernetes is not configured correctly. https://theithollow.com/2020/01/13/deploy-kubernetes-on-aws/

amineKammah commented 3 years ago

Set up tutorial: https://medium.com/@JinnaBalu/kafka-cluster-on-amezon-eks-cluster-5850d67ae723

amineKammah commented 3 years ago

We are trying to implement a real time application. But processing an image once received is not going to allow us to use Spark. We want to group a number of images before processing them, to allow us to take advantage of distributed computing. The initial implemented idea is to get gather 100 images in a list, once there, send these images to the ocr_service. If a timeout passes and we still did not receive all the 100 images, send however many images we already collected.

amineKammah / ensimag-sdtd

Integrating Kafka #5