dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

Adding Multi-Threading Support for gcvocr.sh and Fix Here #41

Open UBISOFT-1 opened 2 years ago

UBISOFT-1 commented 2 years ago

I have gone ahead and updated the script gcvocr.sh in order to accommodate for multi-threading. You can go ahead and view the source code here. https://gist.github.com/UBISOFT-1/4017d641c329159f8de3d203efc919e1

I am adding the updated script here as well in order to show my demo for the update.

The Problem with the Original aka. temp.json

Since the Original .sh file is going ahead and creating ./temp.json file. Imagine a case like mine where I had to go ahead and use a multiprocessing.dummy library in python in order to have a Pool Object of concurrent threads. It turns out to be a problem.

Solution @dinosauria123

Approach 1 Let the User Decide the Random Name of the New Replacement of temp.json

# this script is updated in order to support for multi-threaded applications. 
# $3 is a random md5 hash or a uuid.uuid4() object that enables gcvocr.sh to be used without having two overlapping files named temp.json
#!/bin/bash
#cd ~

echo '{"requests":[{"image":{"content":"' > ./$3.json
openssl base64 -in $1 | cat >> ./$3.json
echo '"},
"features":{"type":"TEXT_DETECTION","maxResults":2048}}]}' >> ./$3.json
curl -k -s -H "Content-Type: application/json" https://vision.googleapis.com/v1/images:annotate?key=$2 --data-binary @./$3.json > "$1.json"
rm ./$3.json

Here the usage of $3 or a random string is discussed in Python in the Gist Link Above.

Approach#2 Use $RANDOM

If you run the following command in your terminal in Ubuntu $RANDOM It gives us a random 6 letter string. Replace Original $3 with $RANDOM.

UBISOFT-1 commented 2 years ago

Function for Conversion

import os
import sys
from multithreading.dummy import Pool
import uuid

    def main_converter_json_handler(self):
        print('[+] Main Script to Go ahead and convert the file to JSON.')
        commands = []
        for image in self.output_image_list:
            # random_thing = uuid.uuid4() could be used here instead of $RANDOM.
            cmd = f'./gcvocr.sh {image} "{self.api_simple}" "$RANDOM"'
            commands.append(cmd)
        pool = Pool(self.gcvocr_threads)  # n concurrent commands at a time
        for i, returncode in enumerate(pool.imap(partial(call, shell=True), commands)):
            if returncode != 0:
                print("%d command failed: %d" % (i, returncode))