[Enhancement] Explicit steps to bring previously-Textracted data

In some cases, users may already have run their corpus through Amazon Textract and want to get started with the sample without taking the cost of re-processing all documents.

Although there's nothing preventing this in the model training code itself today, the notebook walkthrough steps often make S3 structure assumptions. More explicit guidance could greatly reduce the notebook debugging currently required to use pre-Textracted data.

Context

Although the model training itself has a pretty broad interface for accepting JSON-lines manifests like:

{
    "source-ref": "s3://.../.../wherever-your-page-thumbnail-image-is.png",  // images_prefix = "s3://.../..."
    "textract-ref": "s3://.../.../corresponding-textract-result.json", // textract_prefix = "s3://.../..."
    "page-num": 2,  // 1-based number of this page in the textract-ref result
    "labels": { "some-smgt-": "-bbox-compatible-label" },
}

...The notebook sections for preparing/curating the dataset and visualizing results often make more explicit assumptions like:

Textract refs correspond 1:1 with input documents and are at input-doc-path.pdf/consolidated.json
Page thumbnail & full-size images have their S3 paths constructed in particular ways from the raw document URIs.

aws-samples / amazon-textract-transformer-pipeline

[Enhancement] Explicit steps to bring previously-Textracted data #17

Context