fix directory on s3 - Githubissues

OpenPecha / OCR-Pipelines

1 stars 0 forks source link

fix directory on s3 #9

Closed eroux closed 1 year ago

eroux commented 1 year ago

From what I see in https://github.com/OpenPecha/OCR-Pipelines/blob/main/ocr_pipelines/upload.py the s3 prefix is wrong, it should be like

s3://ocr.bdrc.io/Works/05/W1PD289/{service}/{batch}/

with {service} = vision for Google Vision, and {batch} being a unique identifier for the ocr batch of a specific work.

10zinten commented 1 year ago

@eroux can you review my PR for this.

10zinten commented 1 year ago

@eroux since, our old script uploads both ocr output and images to s3://ocr.bdrc.io, should we do the same here?

the directory for

images was s3://ocr.bdrc.io/Works/05/W1PD289/{service}/{batch}/images
ocr outputs was s3://ocr.bdrc.io/Works/05/W1PD289/{service}/{batch}/ocr_outputs

eroux commented 1 year ago

Looking at a random place on s3, I see the ocr output in output/ not ocr_outputs/, as in s3://ocr.bdrc.io/Works/02/W1EE36/vision/batch001/output/, so let's keep it that way

eroux commented 1 year ago

but yes, we should absolutely keep the exact same file organization, I can't think of any reason to change it