Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

BUG - PPTX doesn't recognize text within slide notes #3256

Closed veredmm closed 1 week ago

veredmm commented 1 week ago

pptx/ppt parsing. the text inside the notes of the slide does not appear in the elements output

Unstructured 0.14.3, python-pptx 0.6.23

Reproduce:

Parse example file, notice the text of the notes in slide 2 does not appear in elements array

test_remark.pptx

scanny commented 1 week ago

Thanks for reporting @veredmm, I'll have a look.

scanny commented 1 week ago

@veredmm partition_pptx() only includes slide notes when so instructed with an include_slide_notes=True argument. The default is False.

This file partitions as expected on my machine when using that argument.