tool to prepare ocr data of handwritings data
pip install git+https://github.com/OpenPecha/ocr_handwriting_aligner.git
from ocr_handwriting_aligner.pipeline import pipeline
pdf_file_path = Path("P000015_v001_00001 - 00250.pdf")
transcript_file_path = Path("P000015_v001_transcript.csv")
image_orientation="Portrait"
acceptable_images = pipeline(pdf_file_path, transcript_file_path, image_orientation)
print(f"Number of acceptable line images: {len(acceptable_images)}")
Important Notes:
Outputs after running the above code:
from ocr_handwriting_aligner.parse_transcript import standardize_line_texts_to_images_csv_mapping
csv_file_path = Path("line_image_mapping.csv")
batch_id = "P000015"
volume_id = "v001"
standardize_line_texts_to_images_csv_mapping(csv_file_path, batch_id, volume_id)
Output after running the above code:
standardize csv: named in format "{batchid}{volume_id}.csv" (in this case P000015_v001.csv) with headings "image_name","transcript","image_url"
"image url" is refering to a s3 bucket link
new output dir: a dir name "{batchid}{volume_id}" (in this case P000015_v001), all the acceptable images will be copied in this directory, you can upload the images from this directory to the desired s3 bucket