Issue with Markdown output (textractprettyprinter)

aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.

Apache License 2.0

408 stars 145 forks source link

input_document='s3://.../MY_FILE.pdf' textract_json = call_textract( input_document=input_document, features=[Textract_Features.LAYOUT, Textract_Features.TABLES] ) layout = get_text_from_layout_json( textract_json=textract_json, generate_markdown=True, exclude_page_header=True, exclude_page_footer=True, save_txt_path="./output" )

I'm having the same issue. If you extract both LAYOUT and TABLEs, it looks like this is reproducible in files with tables. The tables for some reason are printed at the end of the output, rather than linearized correctly. My code:

`from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor

extractor = Textractor(profile_name="default")

document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES], save_image=True, )

config_html = TextLinearizationConfig( add_prefixes_and_suffixes_in_text=True, table_tabulate_format="html", )

print(document.get_text(config=config))`

aws-samples / amazon-textract-textractor

Issue with Markdown output (textractprettyprinter) #274