Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.66k stars 707 forks source link

fix: revert dropping of filename extension for some connectors #3109

Closed ryannikolaidis closed 4 months ago

ryannikolaidis commented 4 months ago

V2 refactor of ingest code introduces the removal of original file extensions. Since the upgrade of connectors is incomplete this means that some connectors will remove the original file extension and some will not. Still TBD whether this is actually something we want at all.

This PR reverts specifically that change in the V2 ingest code so that original file extension is preserved downstream.

Testing

CI is passing with filenames updated via Ingest Test Fixtures Update workflow.