[RFW0014] Automated Image-cropping Pipeline

Housekeeping
Owner
Summary
Is This Really Necessary?
Motivation
Named concepts
Examples
Conceptual Design
Drawbacks
Alternatives
New Data
Adoption Window

Housekeeping

Take time to complete each section below with as much detail as is required to establish a comprehensive understanding about the underlying product specification.

ALL BELOW FIELDS ARE REQUIRED

Owner

@eroux
@ngawangtrinley

Summary

BDRC has many images that contain several pecha pages. We need to automate the image-cropping process with a custom computer vision model. This project will use Prodigy as a human-in-the-loop pipeline to create an initial training dataset, train a model, and iteratively improve it.

Is This Really Necessary?

Yes. Because BDRC doesn't have the human power to do this work manually.

Motivation

Several years ago, BDRC received big collections of scanned images, but hasn't been able to catalog and display them on their website because the images have several pecha pages on each image.

The way the pictures were taken is also inconsistent. Images have two, three, or sometimes four pages per image. The order in which the front and backside of pecha folios are grouped is also inconsistent. For example, all the front sides of one folio might appear on one image and the back sides of the same folios might appear on another image. In one image of the front sides of folios, the order might be 1, 2, 3 while the image of the back sides of the same folios might be 3,2,1. In other cases, the order of the back sides might be 1, 2, 3.

Unless we find a way to semi-automatically crop and reorder these images, we won't be able to make these pechas accessible to readers.

Named Concepts

Prodigy: prodi.gy
Image: Refers to photographed or scanned images, which are served by BDRC using the IIIF protocol. In this project, most images are photographed, which includes a lot of background objects behind the pecha page.
Pecha page: Refers to the traditional Tibetan book format in landscape orientation. In the context of this project, several pecha page sides are captured in a single image.

Examples

1) This project will have three outcomes: 1. a human-in-the-loop pipeline built around Prodigy, 2. an image-cropping model, 3 cropped images
2) The cropped images will be made available on the BUDA library.
3) BDRC's partners, who provided these pechas for free, will see them published, and it will relieve the pressure on BDRC to publish them.
4) N/A

Conceptual Design

Proposed process

Deploying an instance of Prodigy on an AWS server.
Describing the format of the input data that you need to get from BDRC. For example, CSV files with image URLs.
Getting the data from a custom BDRC API endpoint.
Load images in Prodigy's manual annotation UI https://prodi.gy/docs/computer-vision#manual.
Train freelancers to do the annotations.
When there's enough data, train a model.
Pre-annotate a second batch of images.
Get humans to correct them.
Loop until the model performs good enough.
Process all of BDRC's images that need automated cropping with the model.

Notes

The model will be considered "good enough" when it separates all the pages of one image without leaving extra borders or cutting text on the pecha pages.
This work doesn't include the reordering of the cropped pages. We'll write scripts to reorder the cropped pages as a separate project.

Commom image formats (Check Elie's comment for actual images)

Three pecha pages on a dark background
Four pecha pages on a dark background
Three pecha pages on a light background
Three light pecha pages on a light background
...

Naming convention

Each image should be named with its own BDRC image ID. The output files will preserve that ID with the page number as a suffix. The order of page numbers should follow top to bottom, left to right order.

Example input: .jpg, I8LS766730003.jpg

Example output: _1.jpg, _2.jpg, I8LS766730003_1.jpg, I8LS766730003_2.jpg

Drawbacks

None

Alternatives

We tested a program called Scantailor, which is semi-automated, but still involves too much manual work to be viable for this project. Another drawback of Scantailer is that it can only detect up to two pages per image.
To outsource completely to humans, but this is too costly, and depending on the tool, it might not be as precise.
Using the software that is shipped with modern scanner, which detects boundaries of pages, but we don't think it will work when you have 3-4 pages on one image. (We didn't thoroughly test this solution.)

New Data

IT could be

Cropped images named according to the naming convention above
JSON files with the bounding box coordinates to that we do the cropping ourselves

Adaption Window

We are hoping to have Prodigy deployed and the manual annotation work start in the next 2-3 weeks. We are hoping to have the final model trained before the end of the year.

OpenPecha / Requests