Unstructured-IO / unstructured-api

Apache License 2.0
509 stars 108 forks source link

feat/add include_slide_notes parameter #455

Closed mackurzawa closed 4 weeks ago

mackurzawa commented 1 month ago

Description

Testing

#  using default value (True) returns additional NarrativeText element that contains notes
curl -X 'POST'   'http://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/notes.pptx'   -F 'output_format="text/csv"' 

# explicit include_slide_notes=True returns additional NarrativeText element that contains notes
curl -X 'POST'   'http://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/notes.pptx'   -F 'output_format="text/csv"' -F 'include_slide_notes=True'

# explicit include_slide_notes=False returns no NarrativeText element 
curl -X 'POST'   'http://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/notes.pptx'   -F 'output_format="text/csv"' -F 'include_slide_notes=False'

Same with file notes.ppt

mackurzawa commented 1 month ago

Hey @awalker4, Do you think it is worth mentioning this change in the changelog as a breaking change due to the different default behavior? Here’s why I’m wondering: if a user calls the API with a PowerPoint file that includes slide notes, they will receive additional elements related to those notes. While the previous elements will remain unchanged, their order might be different. I think it’s pretty rare for a user to care about the element order, but still, it might be worth noting