OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

[RFW0014] Automated Image-cropping Pipeline #29

Open evanyerburgh opened 2 years ago

evanyerburgh commented 2 years ago

Table of Contents

Housekeeping

Take time to complete each section below with as much detail as is required to establish a comprehensive understanding about the underlying product specification.

ALL BELOW FIELDS ARE REQUIRED

Owner

Summary

BDRC has many images that contain several pecha pages. We need to automate the image-cropping process with a custom computer vision model. This project will use Prodigy as a human-in-the-loop pipeline to create an initial training dataset, train a model, and iteratively improve it.

Is This Really Necessary?

Yes. Because BDRC doesn't have the human power to do this work manually.

Motivation

Several years ago, BDRC received big collections of scanned images, but hasn't been able to catalog and display them on their website because the images have several pecha pages on each image.

The way the pictures were taken is also inconsistent. Images have two, three, or sometimes four pages per image. The order in which the front and backside of pecha folios are grouped is also inconsistent. For example, all the front sides of one folio might appear on one image and the back sides of the same folios might appear on another image. In one image of the front sides of folios, the order might be 1, 2, 3 while the image of the back sides of the same folios might be 3,2,1. In other cases, the order of the back sides might be 1, 2, 3.

Unless we find a way to semi-automatically crop and reorder these images, we won't be able to make these pechas accessible to readers.

Named Concepts

Examples

Conceptual Design

Proposed process

Notes

Commom image formats (Check Elie's comment for actual images)

  1. Three pecha pages on a dark background
  2. Four pecha pages on a dark background
  3. Three pecha pages on a light background
  4. Three light pecha pages on a light background
  5. ...

Naming convention

Each image should be named with its own BDRC image ID. The output files will preserve that ID with the page number as a suffix. The order of page numbers should follow top to bottom, left to right order.

Example input: .jpg, I8LS766730003.jpg

Example output: _1.jpg, _2.jpg, I8LS766730003_1.jpg, I8LS766730003_2.jpg

Drawbacks

None

Alternatives

  1. We tested a program called Scantailor, which is semi-automated, but still involves too much manual work to be viable for this project. Another drawback of Scantailer is that it can only detect up to two pages per image.

  2. To outsource completely to humans, but this is too costly, and depending on the tool, it might not be as precise.

  3. Using the software that is shipped with modern scanner, which detects boundaries of pages, but we don't think it will work when you have 3-4 pages on one image. (We didn't thoroughly test this solution.)

New Data

IT could be

  1. Cropped images named according to the naming convention above
  2. JSON files with the bounding box coordinates to that we do the cropping ourselves

Adaption Window

We are hoping to have Prodigy deployed and the manual annotation work start in the next 2-3 weeks. We are hoping to have the final model trained before the end of the year.

eroux commented 2 years ago

Here are a few links for reference:

and a few notes just in order not to forget: