This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
Describe the bug
When sending a large TIF file, around 500MB 661 pages, to READ API with the Form Recognizer SDK library (coordinates: com.azure:azure-ai-formrecognizer:4.1.5), I get very high heap usage when the result of DocumentAnalysisAsyncClient#beginAnalyzeDocument is being handled by the SDK before entering my own subscriber callback. In this case, the problem happens in the com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toAnalyzeResultOperation and specifically when generating polygon points in the toDocumentWords method.
Exception or Stack Trace
java.lang.OutOfMemoryError: Java heap space
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toPolygonPoints(Transforms.java:661) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.lambda$toDocumentWords$20(Transforms.java:803) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms$$Lambda$1639/0x00000281420ce140.apply(Unknown Source) ~[?:?]
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toDocumentWords(Transforms.java:809) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.lambda$toAnalyzeResultOperation$3(Transforms.java:158) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms$$Lambda$1638/0x00000281420cce88.apply(Unknown Source) ~[?:?]
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.lambda$toAnalyzeResultOperation$4(Transforms.java:161) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms$$Lambda$1636/0x0000028142041208.apply(Unknown Source) ~[?:?]
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toAnalyzeResultOperation(Transforms.java:167) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.DocumentAnalysisAsyncClient.lambda$beginAnalyzeDocument$12(DocumentAnalysisAsyncClient.java:433) ~[?:?]
at com.azure.ai.formrecognizer.documentanalysis.DocumentAnalysisAsyncClient$$Lambda$1633/0x00000281420c7a88.apply(Unknown Source) ~[?:?]
To Reproduce
Steps to reproduce the behavior:
Using the prebuilt-read model, simply sending a large TIF file (600+ pages) for OCR should trigger the problem.
Expected behavior
A callback that can be called after each page has been handled in order for us to stream the results to disk without having to load everything in memory before we can get access to the final object.
Screenshots
N/A
Setup (please complete the following information):
Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
Describe the bug When sending a large TIF file, around 500MB 661 pages, to READ API with the Form Recognizer SDK library (coordinates:
com.azure:azure-ai-formrecognizer:4.1.5
), I get very high heap usage when the result ofDocumentAnalysisAsyncClient#beginAnalyzeDocument
is being handled by the SDK before entering my own subscriber callback. In this case, the problem happens in thecom.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toAnalyzeResultOperation
and specifically when generating polygon points in thetoDocumentWords
method.Exception or Stack Trace
To Reproduce Steps to reproduce the behavior: Using the
prebuilt-read
model, simply sending a large TIF file (600+ pages) for OCR should trigger the problem.Code Snippet
Expected behavior A callback that can be called after each page has been handled in order for us to stream the results to disk without having to load everything in memory before we can get access to the final object.
Screenshots N/A
Setup (please complete the following information):
Additional context N/A
Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report