aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
369 stars 135 forks source link

Issue with Markdown output (textractprettyprinter) #274

Open jpbalarini opened 8 months ago

jpbalarini commented 8 months ago

There's an issue when I get the text in Markdown format. For some reason, all the lists duplicate the text. First as "plaintext" and then with the proper Markdown format.

Here's how I'm generating my Markdown file:

input_document='s3://.../MY_FILE.pdf'

textract_json = call_textract(
  input_document=input_document, features=[Textract_Features.LAYOUT, Textract_Features.TABLES]
)
layout = get_text_from_layout_json(
  textract_json=textract_json,
  generate_markdown=True,
  exclude_page_header=True,
  exclude_page_footer=True,
  save_txt_path="./output"
)

Example:

Screenshot 2023-11-13 at 16 12 39
kostabasis commented 6 months ago

I'm having the same issue. If you extract both LAYOUT and TABLEs, it looks like this is reproducible in files with tables. The tables for some reason are printed at the end of the output, rather than linearized correctly. My code:

`from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor

extractor = Textractor(profile_name="default")

document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES], save_image=True, )

config_html = TextLinearizationConfig( add_prefixes_and_suffixes_in_text=True, table_tabulate_format="html", )

print(document.get_text(config=config))`