aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
389 stars 142 forks source link

[Doc] BoundingBox coordinate unit and scale are unclear #319

Open oonisim opened 7 months ago

oonisim commented 7 months ago

classtextractor.entities.bbox.BoundingBox(x: float, y: float, width: float, height: float, spatial_object=None) says:

Represents the bounding box of an object in the format of a dataclass with (x, y, width, height). By default BoundingBox is set to work with denormalized co-ordinates: x: (0, docwidth), y: (0, docheight). Use the as_normalized_dict function to obtain BoundingBox with normalized co-ordinates: x: (0, 1), y: (0, 1)

image

Problem

Definition of docwidth and docheight are not clear.

Clarification

Does pages in the Document objects by default use x:(0, 1) and y:(0, 1) ir x(0: width_in_pixel) and y:(0, height_in_pixels) with regard to (docwidth, docheight) in By default BoundingBox is set to work with denormalized co-ordinates: x: (0, docwidth), y: (0, docheight).?

With the code below, it appears it is using (0,1) but not sure where it is clearly documented and guaranteed to be so in the future.

document = extractor.analyze_document(
    file_source=str(FILEPATH),
    features=[
        TextractFeatures.LAYOUT, 
        TextractFeatures.FORMS, 
        TextractFeatures.TABLES
    ],
    save_image=True,  # To use images property and visualize of the document instance.
)

bbox = document.pages[0].words[0].bbox
print(bbox)
-----
x: 0.40578076243400574, y: 0.14519663155078888, width: 0.08256930857896805, height: 0.009907064028084278

If using the docheight is using the pixel, it should be in between (0, 2339), but apparently it is not using it.

print(f"page height:{document.pages[0].height}, document page 0 image height:{document.images[0].height}")
-----
page height:1.0, document page 0 image height:2339

AWS Textract Document

AWS documentation of BoundingBox is clear that the unit/scale is ratio of page width/height.

  • Height – The height of the bounding box as a ratio of the overall document page height.
  • Left – The X coordinate of the top-left point of the bounding box as a ratio of the overall document page width.
  • Top – The Y coordinate of the top-left point of the bounding box as a ratio of the overall document page height.
  • Width – The width of the bounding box as a ratio of the overall document page width.

Each BoundingBox property has a value between 0 and 1. The value is a ratio of the overall image width (applies to Left and Width) or height (applies to Height and Top). For example, if the input image is 700 x 200 pixels, and the top-left coordinate of the bounding box is (350,50) pixels, the API returns a Left value of 0.5 (350/700) and a Top value of 0.25 (50/200).

image

Belval commented 7 months ago

With the code below, it appears it is using (0,1) but not sure where it is clearly documented and guaranteed to be so in the future.

Correct the BoundingBox default representation are floating point numbers in [0, 1]. The BoundingBox class does have a handful of functions to provide denormalized coordinates if needed.

In terms of parsing everything is done with normalized coordinates: https://github.com/aws-samples/amazon-textract-textractor/blob/master/textractor/parsers/response_parser.py#L272

Hopefully this answers your question.

oonisim commented 7 months ago

Thanks for the follow up. Does it mean the sentence:

By default BoundingBox is set to work with denormalized co-ordinates: x: (0, docwidth), y: (0, docheight). Use the as_normalized_dict function to obtain BoundingBox with normalized co-ordinates: x: (0, 1), y: (0, 1)

should be:

By default BoundingBox is set to work with normalized co-ordinates: x: (0, 1), y: (0, 1). Use appropriate functions to work with the denormalized coordinates.

Is it the correct understanding? I suppose this is more consistent with the Textract documentation.