Creating the data processing part

amineKammah commented 4 years ago

Progress log:

OCR, or Optical character recognition, is the technical term for extracting text for an image. The most powerful OCR tool is Tesseract. It is build by Google.
Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine, allowing to call Tesseract functionality inside Python code. I chose to use Python for now but we can switch to Scala very easily.
For testing, I downloaded a free book ( how to win every argument), converted it to jpeg images and kept about 40 pages (Saved them under test_data/how_to_win_arguments). I put together a Python script to open all the images and extract the text embedded in them. The run time for this took about 74sec on my laptop.
I then used PySpark to call the same code inside a simple Spark context, this improved slightly the run time to 69s.
I wrapped all the extraction code in a class (data_processing/optical_character_recognizer.py) and the Spark execution code in (data_processing/ocr_service.py).
I then created a Dockerfile containing all the necessary dependencies to run the Spark job. To build the docker image: sudo docker build . -t sdtd To run the image: sudo docker run -t -i --rm sdtd
To create a dockerfile image that include all Spark and K8s dependencies, run: sudo /usr/local/spark/bin/docker-image-tool.sh -r kammahm -p ~/ensimag-projects/sdtd/ensimag-sdtd/dockerfile build
Recreate the dockerfile with kafka dependencies
Connect Kafka streaming with Spark Streaming
When running on multiple nodes, Gotta manually add the workers and the master ip address in /etc/hosts

TO DOs:

Play with spark Dynamic allocation

amineKammah commented 4 years ago

Using Google's OCR Tesseract

Installation steps: pip install pytesseract pip install Pillow sudo apt install tesseract-ocr sudo apt install libtesseract-dev

amineKammah commented 4 years ago

Installing Kubernetes, docker and spark on an EC2 ubuntu instance:

amineKammah commented 4 years ago

Spark Image: https://hub.docker.com/repository/docker/kammahm/spark-py

amineKammah commented 4 years ago

Creation de l'image: sudo /usr/local/spark/bin/docker-image-tool.sh -r kammahm -p ~/ensimag-projects/sdtd/ensimag-sdtd/dockerfile build

amineKammah commented 4 years ago

We are dealing with images. The size of images tend to be significant. Therefore sending these images through the network will introduce a heavy latency. Using spark for a small number of images may then be very inefficient. The idea is then to find the threshold (number of images) after which Spark brings a significant improvement in performance compared to parallel computing in a single instance.

amineKammah commented 3 years ago

Understanding Partitioning in Spark: https://medium.com/parrot-prediction/partitioning-in-apache-spark-8134ad840b0

amineKammah commented 3 years ago

Spark dynamic allocation: "spark.dynamicAllocation.enabled": "true" "spark.dynamicAllocation.initialExecutors": "1" "spark.dynamicAllocation.minExecutors": "2" "spark.dynamicAllocation.maxExecutors": "3" "spark.dynamicAllocation.executorIdleTimeout": "7000s" "spark.shuffle.service.enabled": "true" "spark.dynamicAllocation.shuffleTracking.enabled": "true" "spark.network.timeout": "700s" "spark.shuffle.registration.timeout": "6000s"

amineKammah / ensimag-sdtd

Creating the data processing part #2