Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

fix: parsing pdf error - new_cells as str has no "copy" #3130

Closed christinestraub closed 1 month ago

christinestraub commented 1 month ago

Closes #3119.

Testing

Parsing the provided PDF should be successful.

testing_brochure_2.pdf

filename = "testing_brochure_2.pdf"
with open(filename, "rb") as pdf_content:
    elements = partition_pdf(
        file=pdf_content,
        infer_table_structure=True,
        extract_image_block_types=["Image", "Table"],
        chunking_strategy="by_title",
        max_characters=1000,
        new_after_n_chars=3000,
        combine_text_under_n_chars=1000,
    )
print("\n\n".join([str(el) for el in elements]))
cragwolfe commented 1 month ago

Please add file to CI or in an ingest CI test (which has the added benefit of the outputs being browseable).

cragwolfe commented 1 month ago

Please add file to CI or in an ingest CI test (which has the added benefit of the outputs being browseable).

Per slack convo, this can be addressed in a separate PR.