aryn-ai / sycamore

🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.
https://sycamore.readthedocs.io
Apache License 2.0
218 stars 20 forks source link

Do the uploading ourselves for textract so we can avoid repeated uploads #173

Open eric-anderson opened 8 months ago

eric-anderson commented 8 months ago

The textract uploading support uses the default textract code which uploads the file under a guid from scratch each time. It would be better to upload the file under a content based hash so that we can avoid repeated uploads and storage of the same data.

Potentially the implementation here interacts with the way we could cache textract results in https://github.com/aryn-ai/quickstart/issues/3

HenryL27 commented 1 month ago

can we close this now that we have caching?