[RFC0022] serving images for layout analysis

Work Planning

Details

- [Housekeeping](#housekeeping) - [Named Concepts](#named-concepts) - [Summary](#summary) - [Reference-Level Explanation](#reference-level-explanation) - [Alternatives](#alternatives) * [Rationale](#rationale) - [Drawbacks](#drawbacks) - [Useful References](#useful-references) - [Unresolved questions](#unresolved-questions) - [Parts of the system affected](#parts-of-the-system-affected) - [Future possibilities](#future-possibilities) - [Infrastructure](#infrastructure) - [Testing](#testing) - [Documentation](#documentation) - [Version History](#version-history) - [Recordings](#recordings) - [Work Phases](#work-phases)

Housekeeping

[RFC0022] serving images for layout analysis ALL BELOW FIELDS ARE REQUIRED

Named Concepts

*Explain any new concepts introduced in this request.*

Summary

selected unique images needs to be processed using the same methodology as used to process page annotation images and serve to prodigy recipe through a csv file to be streamed to layout_analysis instances

Reference-Level Explanation

- I will be getting a .csv file containing the Repo name `OCR###` and work_id `W######` - download the repo, go through `unique_images` folder to get the list of the unique images - use the work_id to get all the s3_keys of all the images in the work_id on s3 - get the s3_keys of all the unique_images list and write it in a `.txt` text file. - parse the `.txt` file containing s3_keys of selected unique images and process the images using the same processing methodology as used for the processing of sample images for page annotations which include resizing the image, compress the image and encode the image using Pillow - upload the processed images to a s3 bucket like `openpecha.bdrc.io` and append the processed image uploaded s3_key in a csv_file - give the `csv_file_path` to the prodigy recipe so it can parse the csv file to list of s3_keys to stream on `prodigy.bdrc.io/layout_analysis/` - The proposed changes interact with other systems (or other parts of the system that is changed) - The actual implementation will take place - Known challenges can be readily overcome *This section includes practical examples and explain how this proposal makes those examples work.* *This section becomes the engineering specification and work plan, so it must be sufficiently detailed to faciliate for that.*

Alternatives

*Confirm that alternative approaches have been evaluated and explain those alternatives briefly.*

Rationale

- Why the currently proposed design was selected over alternatives? - What would be the impact of going with one of the alternative approaches? - Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches?

Drawbacks

*Describe any particular caveats and drawbacks that may arise from fulfilling this particular request?*

Useful References

we already have all the scripts needed - What similar work have we already successfully completed? - Is this something that have already been built by others? - What other related learnings we have? - Are there useful academic literature or other articles related with this topic? (provide links) - Have we built a relevant prototype previously? - Do we have a rough mock for the UI/UX? - Do we have a schematic for the system?

Unresolved Questions

- What is there that is unresolved (and will be resolved as part of fulfilling this request)? - Are there other requests with same or similar problems to solve?

Parts of the System Affected

- Which parts of the current system are affected by this request? - What other open requests are closely related with this request? - Does this request depend on fulfillment of any other request? - Does any other request depend on the fulfillment of this request?*

Future possibilities

*How do you see the particular system or part of the system affected by this request be altered or extended in the future.*

Infrastructure

- requires a s3 bucket to upload the processed selected unique images. like `opnepecha.bdrc.io` @ngawangtrinley

Testing

image-processing is already tested when used for the processing of images for page annotation

Documentation

*Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.*

Version History

v0.1

Recordings

*Links to audio recordings of related discussion.*

Work Phases

[x] parse .csv file containing the Repo name OCR### and work_id W###### time estimation: 10 min time taken: 10 min
[x] download the repo, go through unique_images folder to get the list of the unique images time estimation: 10 min time taken: 10 min
[x] use the work_id to get all the s3_keys of all the images in the work_id on s3 time estimation: 1 hour time taken: 1 hour
[x] get the s3_keys of all the unique_images list and write it in a .txt text file on prodigy-tools. time estimation: 30 min time taken: 30 min
[x] #19 time estimation: 1 hours time taken:
[x] #20 time estimation: 10 min time taken:

Planning

Keep original naming and structure, and keep as first section in Work phases section

[ ] RFC completed on: Estimated time: Actual time:
[ ] RFC reviewed and approved by: Estimated time: Actual time:

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

[ ] PR 1 Estimated time: Actual time:
[ ] PR 2 Estimated time: Actual time:

Completion

[ ] Tested and approved by: @username @username Estimated time: Actual time:
[ ] Documentation approved @evanyerburgh Estimated time: Actual time:

OpenPecha / prodigy-tools