aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
408 stars 145 forks source link

Issue with Markdown output (textractprettyprinter) #274

Open jpbalarini opened 1 year ago

jpbalarini commented 1 year ago

There's an issue when I get the text in Markdown format. For some reason, all the lists duplicate the text. First as "plaintext" and then with the proper Markdown format.

Here's how I'm generating my Markdown file:

input_document='s3://.../MY_FILE.pdf'

textract_json = call_textract(
  input_document=input_document, features=[Textract_Features.LAYOUT, Textract_Features.TABLES]
)
layout = get_text_from_layout_json(
  textract_json=textract_json,
  generate_markdown=True,
  exclude_page_header=True,
  exclude_page_footer=True,
  save_txt_path="./output"
)

Example:

Screenshot 2023-11-13 at 16 12 39
kostabasis commented 10 months ago

I'm having the same issue. If you extract both LAYOUT and TABLEs, it looks like this is reproducible in files with tables. The tables for some reason are printed at the end of the output, rather than linearized correctly. My code:

`from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor

extractor = Textractor(profile_name="default")

document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES], save_image=True, )

config_html = TextLinearizationConfig( add_prefixes_and_suffixes_in_text=True, table_tabulate_format="html", )

print(document.get_text(config=config))`