Open amineKammah opened 4 years ago
Using Google's OCR Tesseract
Installation steps: pip install pytesseract pip install Pillow sudo apt install tesseract-ocr sudo apt install libtesseract-dev
Installing Kubernetes, docker and spark on an EC2 ubuntu instance:
Creation de l'image: sudo /usr/local/spark/bin/docker-image-tool.sh -r kammahm -p ~/ensimag-projects/sdtd/ensimag-sdtd/dockerfile build
We are dealing with images. The size of images tend to be significant. Therefore sending these images through the network will introduce a heavy latency. Using spark for a small number of images may then be very inefficient. The idea is then to find the threshold (number of images) after which Spark brings a significant improvement in performance compared to parallel computing in a single instance.
Understanding Partitioning in Spark: https://medium.com/parrot-prediction/partitioning-in-apache-spark-8134ad840b0
Spark dynamic allocation:
"spark.dynamicAllocation.enabled": "true" "spark.dynamicAllocation.initialExecutors": "1" "spark.dynamicAllocation.minExecutors": "2" "spark.dynamicAllocation.maxExecutors": "3" "spark.dynamicAllocation.executorIdleTimeout": "7000s" "spark.shuffle.service.enabled": "true" "spark.dynamicAllocation.shuffleTracking.enabled": "true" "spark.network.timeout": "700s" "spark.shuffle.registration.timeout": "6000s"
Progress log:
TO DOs: