aryn-ai / sycamore

🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.
https://sycamore.readthedocs.io
Apache License 2.0
300 stars 31 forks source link

aryn-sdk iter over content not lines #593

Closed HenryL27 closed 1 month ago

HenryL27 commented 1 month ago

requests iter_lines() does n^2 string building which is bad when you have very big lines (such as with extract_images=True). This (I'm like 99% sure) turns that linear.

fyi

byteses = b"hello\nworld\n"
byteses.split(b'\n') # -> [b'hello', b'world', b'']

Also updated the test cases since we changed how APS sorts elements. While I was at it, I pretty-printed the json files to simplify the mocking code

adapted from #589