Open jpbalarini opened 1 year ago
I'm having the same issue. If you extract both LAYOUT and TABLEs, it looks like this is reproducible in files with tables. The tables for some reason are printed at the end of the output, rather than linearized correctly. My code:
`from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES], save_image=True, )
config_html = TextLinearizationConfig( add_prefixes_and_suffixes_in_text=True, table_tabulate_format="html", )
print(document.get_text(config=config))`
There's an issue when I get the text in Markdown format. For some reason, all the lists duplicate the text. First as "plaintext" and then with the proper Markdown format.
Here's how I'm generating my Markdown file:
Example: