aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
88 stars 25 forks source link

[Enhancement] Explicit steps to bring previously-Textracted data #17

Open athewsey opened 2 years ago

athewsey commented 2 years ago

In some cases, users may already have run their corpus through Amazon Textract and want to get started with the sample without taking the cost of re-processing all documents.

Although there's nothing preventing this in the model training code itself today, the notebook walkthrough steps often make S3 structure assumptions. More explicit guidance could greatly reduce the notebook debugging currently required to use pre-Textracted data.

Context

Although the model training itself has a pretty broad interface for accepting JSON-lines manifests like:

{
    "source-ref": "s3://.../.../wherever-your-page-thumbnail-image-is.png",  // images_prefix = "s3://.../..."
    "textract-ref": "s3://.../.../corresponding-textract-result.json", // textract_prefix = "s3://.../..."
    "page-num": 2,  // 1-based number of this page in the textract-ref result
    "labels": { "some-smgt-": "-bbox-compatible-label" },
}

...The notebook sections for preparing/curating the dataset and visualizing results often make more explicit assumptions like: