aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
404 stars 145 forks source link

heuristic_line_break_threshold, along with other heuristic constants not doing anything #294

Open kostabasis opened 10 months ago

kostabasis commented 10 months ago

I noticed that even when testing extreme values of heuristic_line_break_threshold, heuristic_overlap_ratio, and heuristic_h_tolerance there was no change in the output. This led me to examine their use in the library, and it appears heuristic_line_break_threshold is never once utilized outside of the class parameter declaration. The other two are used, but still do nothing. Were these features simply released prematurely, or am I doing something wrong?

This is my basic logic for getting output:

`from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor

extractor = Textractor(profile_name="default")

document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.TABLES, TextractFeatures.SIGNATURES, TextractFeatures.LAYOUT], save_image=True )

config = TextLinearizationConfig( title_prefix="# ", section_header_prefix="## ", add_prefixes_and_suffixes_in_text=True, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, table_linearization_format="markdown", ) print(document.get_text(config=config))`

Belval commented 10 months ago

Good catch, this should be addressed by #298

kostabasis commented 9 months ago

Hey, is there any chance this is already fixed in the PR? I tried locally testing the current PR with the commands below and heuristic_line_break_threshold doesn't seem to be used anywhere still.

git clone git@github.com:aws-samples/amazon-textract-textractor.git
cd amazon-textract-textractor
git checkout origin version-1.7.0
pip install -e .
Belval commented 9 months ago

See https://github.com/aws-samples/amazon-textract-textractor/pull/298/commits/e9da3b0438598b3e2f99f810d1893d1ff65c2125

kostabasis commented 9 months ago

I see, thank you. It was an issue on my end.