Closed eroux closed 1 year ago
@eroux since, our old script uploads both ocr output and images to s3://ocr.bdrc.io
, should we do the same here?
the directory for
s3://ocr.bdrc.io/Works/05/W1PD289/{service}/{batch}/images
s3://ocr.bdrc.io/Works/05/W1PD289/{service}/{batch}/ocr_outputs
Looking at a random place on s3, I see the ocr output in output/
not ocr_outputs/
, as in s3://ocr.bdrc.io/Works/02/W1EE36/vision/batch001/output/
, so let's keep it that way
but yes, we should absolutely keep the exact same file organization, I can't think of any reason to change it
From what I see in https://github.com/OpenPecha/OCR-Pipelines/blob/main/ocr_pipelines/upload.py the s3 prefix is wrong, it should be like
with
{service}
=vision
for Google Vision, and{batch}
being a unique identifier for the ocr batch of a specific work.