Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.59k stars 2.8k forks source link

[Document Intelligence] Stream response for files with large text content to prevent OOM event #37750

Open kevinkupski opened 2 weeks ago

kevinkupski commented 2 weeks ago

Is your feature request related to a problem? Please describe. We see high memory usage in production (which leads to Out-of-Memory-Errors) when users upload files with a lot of textual content to our app which uses Document Intelligence. For a test file with ~200.000 characters, we have 240MB in memory allocated when calling poller.result(), but if we extract the relevant content (strings) it's only 10 MB.

It looks like the the relevant code for this is located here. Does anybody have an approach/idea to limit memory usage?

Describe the solution you'd like We'd like to reduce the data held in memory. It looks like the API does not provide this but we'd like to stream the result from Document Intelligence and process it chunk by chunk. Maybe as JSON Lines or any other streamable data format.

Describe alternatives you've considered Alternatively we only require the field paragraphs and could discard the rest of the response to reduce the size of the response – like a select on the fields of the response. This would not scale as good as the streaming approach, but might improve our current situation a bit.

xiangyan99 commented 2 weeks ago

Thanks for the feedback, we’ll investigate asap.

bojunehsu commented 10 hours ago

There is currently no built-in mechanism to address this. A few possible workarounds:

kevinkupski commented 10 hours ago

@bojunehsu thank you for the feedback and the hint about json_stream. Will have a look into that. 👍