Take time to complete each section below with as much detail as is required to establish a comprehensive understanding about the underlying product specification.
ALL BELOW FIELDS ARE REQUIRED
Owner
@eroux
@ngawangtrinley
Summary
BDRC has many images that contain several pecha pages. We need to automate the image-cropping process with a custom computer vision model. This project will use Prodigy as a human-in-the-loop pipeline to create an initial training dataset, train a model, and iteratively improve it.
Is This Really Necessary?
Yes. Because BDRC doesn't have the human power to do this work manually.
Motivation
Several years ago, BDRC received big collections of scanned images, but hasn't been able to catalog and display them on their website because the images have several pecha pages on each image.
The way the pictures were taken is also inconsistent. Images have two, three, or sometimes four pages per image. The order in which the front and backside of pecha folios are grouped is also inconsistent. For example, all the front sides of one folio might appear on one image and the back sides of the same folios might appear on another image. In one image of the front sides of folios, the order might be 1, 2, 3 while the image of the back sides of the same folios might be 3,2,1. In other cases, the order of the back sides might be 1, 2, 3.
Unless we find a way to semi-automatically crop and reorder these images, we won't be able to make these pechas accessible to readers.
Image: Refers to photographed or scanned images, which are served by BDRC using the IIIF protocol. In this project, most images are photographed, which includes a lot of background objects behind the pecha page.
Pecha page: Refers to the traditional Tibetan book format in landscape orientation. In the context of this project, several pecha page sides are captured in a single image.
Examples
1) This project will have three outcomes: 1. a human-in-the-loop pipeline built around Prodigy, 2. an image-cropping model, 3 cropped images
2) The cropped images will be made available on the BUDA library.
3) BDRC's partners, who provided these pechas for free, will see them published, and it will relieve the pressure on BDRC to publish them.
4) N/A
Conceptual Design
Proposed process
Deploying an instance of Prodigy on an AWS server.
Describing the format of the input data that you need to get from BDRC. For example, CSV files with image URLs.
Process all of BDRC's images that need automated cropping with the model.
Notes
The model will be considered "good enough" when it separates all the pages of one image without leaving extra borders or cutting text on the pecha pages.
This work doesn't include the reordering of the cropped pages. We'll write scripts to reorder the cropped pages as a separate project.
Commom image formats(Check Elie's comment for actual images)
Three pecha pages on a dark background
Four pecha pages on a dark background
Three pecha pages on a light background
Three light pecha pages on a light background
...
Naming convention
Each image should be named with its own BDRC image ID. The output files will preserve that ID with the page number as a suffix. The order of page numbers should follow top to bottom, left to right order.
Example input: .jpg, I8LS766730003.jpg
Example output: _1.jpg, _2.jpg, I8LS766730003_1.jpg, I8LS766730003_2.jpg
Drawbacks
None
Alternatives
We tested a program called Scantailor, which is semi-automated, but still involves too much manual work to be viable for this project. Another drawback of Scantailer is that it can only detect up to two pages per image.
To outsource completely to humans, but this is too costly, and depending on the tool, it might not be as precise.
Using the software that is shipped with modern scanner, which detects boundaries of pages, but we don't think it will work when you have 3-4 pages on one image. (We didn't thoroughly test this solution.)
New Data
IT could be
Cropped images named according to the naming convention above
JSON files with the bounding box coordinates to that we do the cropping ourselves
Adaption Window
We are hoping to have Prodigy deployed and the manual annotation work start in the next 2-3 weeks. We are hoping to have the final model trained before the end of the year.
sometimes images are very big and uncompressed, but we can work in compressed, smaller size images in prodigy and then translate the pixel coordinates to the big image
the current thinking is that the images to crop would be on a dedicated s3 bucket
we should think of a notification system: "there are images to crop", "these images have been cropped", but this could be separate
Table of Contents
Housekeeping
Take time to complete each section below with as much detail as is required to establish a comprehensive understanding about the underlying product specification.
ALL BELOW FIELDS ARE REQUIRED
Owner
Summary
BDRC has many images that contain several pecha pages. We need to automate the image-cropping process with a custom computer vision model. This project will use Prodigy as a human-in-the-loop pipeline to create an initial training dataset, train a model, and iteratively improve it.
Is This Really Necessary?
Yes. Because BDRC doesn't have the human power to do this work manually.
Motivation
Several years ago, BDRC received big collections of scanned images, but hasn't been able to catalog and display them on their website because the images have several pecha pages on each image.
The way the pictures were taken is also inconsistent. Images have two, three, or sometimes four pages per image. The order in which the front and backside of pecha folios are grouped is also inconsistent. For example, all the front sides of one folio might appear on one image and the back sides of the same folios might appear on another image. In one image of the front sides of folios, the order might be 1, 2, 3 while the image of the back sides of the same folios might be 3,2,1. In other cases, the order of the back sides might be 1, 2, 3.
Unless we find a way to semi-automatically crop and reorder these images, we won't be able to make these pechas accessible to readers.
Named Concepts
Examples
1) This project will have three outcomes: 1. a human-in-the-loop pipeline built around Prodigy, 2. an image-cropping model, 3 cropped images
2) The cropped images will be made available on the BUDA library.
3) BDRC's partners, who provided these pechas for free, will see them published, and it will relieve the pressure on BDRC to publish them.
4) N/A
Conceptual Design
Proposed process
Notes
Commom image formats (Check Elie's comment for actual images)
Naming convention
Each image should be named with its own BDRC image ID. The output files will preserve that ID with the page number as a suffix. The order of page numbers should follow top to bottom, left to right order.
Example input:.jpg,
I8LS766730003.jpg
Example output:_1.jpg, _2.jpg,
I8LS766730003_1.jpg
,I8LS766730003_2.jpg
Drawbacks
None
Alternatives
We tested a program called Scantailor, which is semi-automated, but still involves too much manual work to be viable for this project. Another drawback of Scantailer is that it can only detect up to two pages per image.
To outsource completely to humans, but this is too costly, and depending on the tool, it might not be as precise.
Using the software that is shipped with modern scanner, which detects boundaries of pages, but we don't think it will work when you have 3-4 pages on one image. (We didn't thoroughly test this solution.)
New Data
IT could be
Adaption Window
We are hoping to have Prodigy deployed and the manual annotation work start in the next 2-3 weeks. We are hoping to have the final model trained before the end of the year.