BoiseState-AdaptLab / OCR_4_Forest_Service

Other
2 stars 0 forks source link

OCR_4_Forest_Service

Table of contents

General info

OCR_4_Forest_Service implements a pdf processing pipeline used to extract handwritten words from the Forest Service forms. All the code can be executed using the bash scripts under the exec/ folder and detected data is stored into the Forest Service database using files in the db/ folder.

Technologies

Make sure you have the following installed:

Pre-Setup

The structure of this repository assumes that you have installed postgres from source under a folder called tools on the same level as this repo's. We are also assuming that the database folder is at that same root level and is called forest_data.

Setup

To run this project, clone it locally using the following command:

$ git clone https://github.com/BoiseState-AdaptLab/OCR_4_Forest_Service.git

Once cloned, navigate to the ocr_4_forest_service directory:

$ cd OCR_4_Forest_Service/

The pipeline can be run in production mode or testing mode (-t).

After having clones the repo and having cd'ed into it, the program can be run using the following command:

-- Run everything under OCR_4_Forest_Service/

1) bash exec/step0_initialize_db.sh This file needs to be executed only one. WARNING:warning:: if you run this file more than once, the data stored in the database up to that point will be lost. 2) The execution of step1_connect_db.sh can be skipped if no database issues have occurred. 3) bash exec/step2_generate_data.sh -i <input file name> -v <pipeline/google> -n <1/2/3> This script creates the virtual enviroment and installs the packages needed to run the pipeline. The first time you run this file, execution will take longer because all packages needs to be installed. This script also runs the entire pipeline and stores the results into the database. We need to provide the name of the input file, the version of the program that we want to run (either our pipeline or the google vision API). :no_entry_sign:NOTE: if you want to run the google vision API you need to navigate to the db/orm_scripts/create_report.py file and change the name of the json file you want to read from on line 15. Follow in-file comments to make this change. Example of command bash exec/step2_generate_data.sh -i test9.jpeg -v pipeline -n 3 5) Run bash exec/step3_save_leave.sh when you are done processing files and you want to disconnect from the database server.

Pipeline's Parameters

You don't have to worry about the direct parameters for the pipeline because the bash scripts above take care of those for you. But if you are curious, this is how it's set up:

Template numbers

When runninig the second bash script (bash exec/step2_generate_data.sh), you'll need to know what version of template you are running. Follow the images below as a reference.

Run Pipeline (The user doesn't need to worry about this section)

To run the production pipeline:

$ python ocr_4_forest_service.py -i <name-of-pdf-form> -json <json-coordinate-file> -temp <template-for-file-alignment>

Thid command will create a csv file called test_data.csv that will be the input to the Optical Character Recognition model.

If you want to run the testing pipeline, execute the following command (The same as above but followed by the -t flag):

$ python ocr_4_forest_service.py -i <name-of-pdf-form> -json <json-coordinate-file> -temp <template-for-file-alignment> -t

Note: When running the testing pipeline, the json file needs to include labels for each field. We provide some instances of these files inside the inputs/jsons directory, each named after the input form.

Known Issues that need to be fixed -- Future Work

There's some known bugs that this software has that need to be fixed. This is not a comprehensive list of bugs but it's rather a place to start.