amineKammah / ensimag-sdtd

0 stars 0 forks source link

Creating the data processing part #2

Open amineKammah opened 3 years ago

amineKammah commented 3 years ago

Progress log:

TO DOs:

amineKammah commented 3 years ago

Using Google's OCR Tesseract

Installation steps: pip install pytesseract pip install Pillow sudo apt install tesseract-ocr sudo apt install libtesseract-dev

amineKammah commented 3 years ago

Installing Kubernetes, docker and spark on an EC2 ubuntu instance:

amineKammah commented 3 years ago

Spark Image: https://hub.docker.com/repository/docker/kammahm/spark-py

amineKammah commented 3 years ago

Creation de l'image: sudo /usr/local/spark/bin/docker-image-tool.sh -r kammahm -p ~/ensimag-projects/sdtd/ensimag-sdtd/dockerfile build

amineKammah commented 3 years ago

We are dealing with images. The size of images tend to be significant. Therefore sending these images through the network will introduce a heavy latency. Using spark for a small number of images may then be very inefficient. The idea is then to find the threshold (number of images) after which Spark brings a significant improvement in performance compared to parallel computing in a single instance.

amineKammah commented 3 years ago

Understanding Partitioning in Spark: https://medium.com/parrot-prediction/partitioning-in-apache-spark-8134ad840b0

amineKammah commented 3 years ago

Spark dynamic allocation: "spark.dynamicAllocation.enabled": "true" "spark.dynamicAllocation.initialExecutors": "1" "spark.dynamicAllocation.minExecutors": "2" "spark.dynamicAllocation.maxExecutors": "3" "spark.dynamicAllocation.executorIdleTimeout": "7000s" "spark.shuffle.service.enabled": "true" "spark.dynamicAllocation.shuffleTracking.enabled": "true" "spark.network.timeout": "700s" "spark.shuffle.registration.timeout": "6000s"