aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
220 stars 96 forks source link

(JS) Option to group checkboxes by proximity #183

Open athewsey opened 4 months ago

athewsey commented 4 months ago

In real-world forms, checkboxes / selection elements are usually grouped similar to the example below (from the Textract try-it-out console doc):

image

In Textract Key-Value Forms results, these items generally appear as un-grouped K-V pairs like:

As of today, Textract doesn't do any grouping of these selection element fields, and also doesn't give us any mapping to predicted overall group label (e.g. Mortgage Applied for: versus Authorization Type:). I received a request from a customer for TRP (JS) to try and help more with this.

Since we don't really do ML within TRP itself, we can't get too fancy here... But I think it should be feasible to provide a way to access and iterate "selection groups" of form fields whose values are selection elements, by basic proximity heuristics?

something along the lines of e.g:

for (const group of page.form.iterSelectionGroups({
  // Whatever *optional* heuristic grouping parameters make sense:
  vDistTol: 0.6,
  hDistTol: 2.4,
})) {
  // Can loop through the Form Fields:
  group.listFields();
  // Maybe some other convenience methods?:
  group.listSelectedNames() == ["Conventional"];
  group.listUnselectedNames() == ["VA", "Other (explain):", "FHA", "USDA/Rural Housing Service"];

  // This will *not* be feasible:
  // group.name == "Mortgage Applied For:"
}

Tagging the label/name of the group wouldn't really be possible without a feasible ML model, which I don't think we're looking to introduce in TRP at this time. While I think we could get okay performance on grouping the checkboxes from heuristics alone, identifying the label would be much less likely to work well.

Interested to hear feedback from others on what kind of API & accessors you'd find most helpful for this feature