Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.17k stars 4.53k forks source link

[Azure AI Services] Document Analysis SDK Response - Memory Issue #41109

Open mehmetaltuntas opened 5 months ago

mehmetaltuntas commented 5 months ago

Library name and version

Azure.AI.FormRecognizer 4.1.0

Query/Question

Hi,

I am using Azure.AI.FormRecognizer SDK 4.1.0 to call up AnalyzeDocumentAsync and receive an Operation object. Once I receive the object, I loop through it and pull out contents, tables etc. I am using prebuilt-read model.

I do attempt to send a file worth 10 MB or more across and waiting for it to be completed via WaitUntil.Completed and once the response is given, I do get up to 1.2 GB memory usage. This actually causes a out-of-memory exception in a container which has limited memory. I do deploy in Azure Container Apps as a job/console app.

I also tried to use the API, however, I cannot even receive the response result object via postman tool.

I would like to know what is the best approach using the SDK as well as handling the response object size so that it doesn't cause any memory usage issue.

Thanks

Environment

.NET SDK: Version: 7.0.203 Commit: 5b005c19f5

Runtime Environment: OS Name: Windows OS Version: 10.0.22621 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\7.0.203\

Host: Version: 7.0.5 Architecture: x64 Commit: 8042d61b17

.NET SDKs installed: 7.0.203 [C:\Program Files\dotnet\sdk]

jsquire commented 5 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

mehmetaltuntas commented 5 months ago

Does Form Recognizer have a feature to drop the analyzed document result into a blob so that the big chunk result object won't be captured by the sender which is causing the memory exception in the first place?

kinelski commented 5 months ago

Hello @mehmetaltuntas,

Have you tried using the Pages analyze option? It allows you specify a subset of pages from your document to be analyzed. You should be able to reduce memory allocation by analyzing the document in batches.

Usage example:

var options = new AnalyzeDocumentOptions() {
    Pages = { "1-50" }
};
var operation = client.AnalyzeDocument(WaitUntil.Completed, <model-id>, <document>, options);

Please let me know if this solves your problem.

github-actions[bot] commented 5 months ago

Hi @mehmetaltuntas. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

mehmetaltuntas commented 5 months ago

Hello @mehmetaltuntas,

Have you tried using the Pages analyze option? It allows you specify a subset of pages from your document to be analyzed. You should be able to reduce memory allocation by analyzing the document in batches.

Usage example:

var options = new AnalyzeDocumentOptions() {
    Pages = { "1-50" }
};
var operation = client.AnalyzeDocument(WaitUntil.Completed, <model-id>, <document>, options);

Please let me know if this solves your problem.

Thanks for that - I will give a go. It obviously forces the system to apply logic to get the completed result for the whole document. It could also help out with the chunking however it may not be the best fit for it.

BlackGad commented 2 months ago

Seems this version of client does not support paging for docx files:

Invalid argument.
Status: 400 (Bad Request)
ErrorCode: InvalidArgument

Content:
{"error":{"code":"InvalidArgument","message":"Invalid argument.","innererror":{"code":"InvalidParameter","message":"The parameter pages is invalid: The page range is unsupported."}}}
BlackGad commented 2 months ago

Does Form Recognizer have a feature to drop the analyzed document result into a blob so that the big chunk result object won't be captured by the sender which is causing the memory exception in the first place?

I'm also very interested in such a feature. Unfortunately, the library currently doesn't support streaming responses, so the ability to directly save results to blob storage would be a great workaround for us as well.