[RFC0022] serving images for layout analysis
ALL BELOW FIELDS ARE REQUIRED
Named Concepts
*Explain any new concepts introduced in this request.*
Summary
selected unique images needs to be processed using the same methodology as used to process page annotation images and serve to prodigy recipe through a csv file to be streamed to layout_analysis instances
Reference-Level Explanation
- I will be getting a .csv file containing the Repo name `OCR###` and work_id `W######`
- download the repo, go through `unique_images` folder to get the list of the unique images
- use the work_id to get all the s3_keys of all the images in the work_id on s3
- get the s3_keys of all the unique_images list and write it in a `.txt` text file.
- parse the `.txt` file containing s3_keys of selected unique images and process the images using the same processing methodology as used for the processing of sample images for page annotations which include resizing the image, compress the image and encode the image using Pillow
- upload the processed images to a s3 bucket like `openpecha.bdrc.io` and append the processed image uploaded s3_key in a csv_file
- give the `csv_file_path` to the prodigy recipe so it can parse the csv file to list of s3_keys to stream on `prodigy.bdrc.io/layout_analysis/`
- The proposed changes interact with other systems (or other parts of the system that is changed)
- The actual implementation will take place
- Known challenges can be readily overcome
*This section includes practical examples and explain how this proposal makes those examples work.*
*This section becomes the engineering specification and work plan, so it must be sufficiently detailed to faciliate for that.*
Alternatives
*Confirm that alternative approaches have been evaluated and explain those alternatives briefly.*
Rationale
- Why the currently proposed design was selected over alternatives?
- What would be the impact of going with one of the alternative approaches?
- Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches?
Drawbacks
*Describe any particular caveats and drawbacks that may arise from fulfilling this particular request?*
Useful References
we already have all the scripts needed
- What similar work have we already successfully completed?
- Is this something that have already been built by others?
- What other related learnings we have?
- Are there useful academic literature or other articles related with this topic? (provide links)
- Have we built a relevant prototype previously?
- Do we have a rough mock for the UI/UX?
- Do we have a schematic for the system?
Unresolved Questions
- What is there that is unresolved (and will be resolved as part of fulfilling this request)?
- Are there other requests with same or similar problems to solve?
Parts of the System Affected
- Which parts of the current system are affected by this request?
- What other open requests are closely related with this request?
- Does this request depend on fulfillment of any other request?
- Does any other request depend on the fulfillment of this request?*
Future possibilities
*How do you see the particular system or part of the system affected by this request be altered or extended in the future.*
Infrastructure
- requires a s3 bucket to upload the processed selected unique images. like `opnepecha.bdrc.io` @ngawangtrinley
Testing
image-processing is already tested when used for the processing of images for page annotation
Documentation
*Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.*
Version History
v0.1
Recordings
*Links to audio recordings of related discussion.*
Work Phases
[x] parse .csv file containing the Repo name OCR### and work_id W######
time estimation: 10 min
time taken: 10 min
[x] download the repo, go through unique_images folder to get the list of the unique images
time estimation: 10 min
time taken: 10 min
[x] use the work_id to get all the s3_keys of all the images in the work_id on s3
time estimation: 1 hour
time taken: 1 hour
[x] get the s3_keys of all the unique_images list and write it in a .txt text file on prodigy-tools.
time estimation: 30 min
time taken: 30 min
[x] #19
time estimation: 1 hours
time taken:
[x] #20
time estimation: 10 min
time taken:
Planning
Keep original naming and structure, and keep as first section in Work phases section
[ ] RFC completed on:
Estimated time:
Actual time:
[ ] RFC reviewed and approved by:
Estimated time:
Actual time:
Implementation
A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.
[ ] PR 1
Estimated time:
Actual time:
[ ] PR 2
Estimated time:
Actual time:
Completion
[ ] Tested and approved by: @username @username
Estimated time:
Actual time:
[ ] Documentation approved @evanyerburgh
Estimated time:
Actual time:
Work Planning
Details
Table of Contents
- [Housekeeping](#housekeeping) - [Named Concepts](#named-concepts) - [Summary](#summary) - [Reference-Level Explanation](#reference-level-explanation) - [Alternatives](#alternatives) * [Rationale](#rationale) - [Drawbacks](#drawbacks) - [Useful References](#useful-references) - [Unresolved questions](#unresolved-questions) - [Parts of the system affected](#parts-of-the-system-affected) - [Future possibilities](#future-possibilities) - [Infrastructure](#infrastructure) - [Testing](#testing) - [Documentation](#documentation) - [Version History](#version-history) - [Recordings](#recordings) - [Work Phases](#work-phases)Housekeeping
[RFC0022] serving images for layout analysis ALL BELOW FIELDS ARE REQUIREDNamed Concepts
*Explain any new concepts introduced in this request.*Summary
selected unique images needs to be processed using the same methodology as used to process page annotation images and serve to prodigy recipe through a csv file to be streamed to layout_analysis instancesReference-Level Explanation
- I will be getting a .csv file containing the Repo name `OCR###` and work_id `W######` - download the repo, go through `unique_images` folder to get the list of the unique images - use the work_id to get all the s3_keys of all the images in the work_id on s3 - get the s3_keys of all the unique_images list and write it in a `.txt` text file. - parse the `.txt` file containing s3_keys of selected unique images and process the images using the same processing methodology as used for the processing of sample images for page annotations which include resizing the image, compress the image and encode the image using Pillow - upload the processed images to a s3 bucket like `openpecha.bdrc.io` and append the processed image uploaded s3_key in a csv_file - give the `csv_file_path` to the prodigy recipe so it can parse the csv file to list of s3_keys to stream on `prodigy.bdrc.io/layout_analysis/` - The proposed changes interact with other systems (or other parts of the system that is changed) - The actual implementation will take place - Known challenges can be readily overcome *This section includes practical examples and explain how this proposal makes those examples work.* *This section becomes the engineering specification and work plan, so it must be sufficiently detailed to faciliate for that.*Alternatives
*Confirm that alternative approaches have been evaluated and explain those alternatives briefly.*Rationale
- Why the currently proposed design was selected over alternatives? - What would be the impact of going with one of the alternative approaches? - Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches?Drawbacks
*Describe any particular caveats and drawbacks that may arise from fulfilling this particular request?*Useful References
we already have all the scripts needed - What similar work have we already successfully completed? - Is this something that have already been built by others? - What other related learnings we have? - Are there useful academic literature or other articles related with this topic? (provide links) - Have we built a relevant prototype previously? - Do we have a rough mock for the UI/UX? - Do we have a schematic for the system?Unresolved Questions
- What is there that is unresolved (and will be resolved as part of fulfilling this request)? - Are there other requests with same or similar problems to solve?Parts of the System Affected
- Which parts of the current system are affected by this request? - What other open requests are closely related with this request? - Does this request depend on fulfillment of any other request? - Does any other request depend on the fulfillment of this request?*Future possibilities
*How do you see the particular system or part of the system affected by this request be altered or extended in the future.*Infrastructure
- requires a s3 bucket to upload the processed selected unique images. like `opnepecha.bdrc.io` @ngawangtrinleyTesting
image-processing is already tested when used for the processing of images for page annotationDocumentation
*Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.*Version History
v0.1Recordings
*Links to audio recordings of related discussion.*Work Phases
OCR###
and work_idW######
time estimation: 10 min time taken: 10 minunique_images
folder to get the list of the unique images time estimation: 10 min time taken: 10 min.txt
text file onprodigy-tools
. time estimation: 30 min time taken: 30 minPlanning
Keep original naming and structure, and keep as first section in Work phases section
Implementation
A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.
Completion