Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.34k stars 1.98k forks source link

[BUG] #42511

Open kpentaris opened 3 hours ago

kpentaris commented 3 hours ago

Describe the bug When sending a large TIF file, around 500MB 661 pages, to READ API with the Form Recognizer SDK library (coordinates: com.azure:azure-ai-formrecognizer:4.1.5), I get very high heap usage when the result of DocumentAnalysisAsyncClient#beginAnalyzeDocument is being handled by the SDK before entering my own subscriber callback. In this case, the problem happens in the com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toAnalyzeResultOperation and specifically when generating polygon points in the toDocumentWords method.

Exception or Stack Trace

java.lang.OutOfMemoryError: Java heap space
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toPolygonPoints(Transforms.java:661) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.lambda$toDocumentWords$20(Transforms.java:803) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms$$Lambda$1639/0x00000281420ce140.apply(Unknown Source) ~[?:?]
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toDocumentWords(Transforms.java:809) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.lambda$toAnalyzeResultOperation$3(Transforms.java:158) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms$$Lambda$1638/0x00000281420cce88.apply(Unknown Source) ~[?:?]
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.lambda$toAnalyzeResultOperation$4(Transforms.java:161) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms$$Lambda$1636/0x0000028142041208.apply(Unknown Source) ~[?:?]
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.implementation.util.Transforms.toAnalyzeResultOperation(Transforms.java:167) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.DocumentAnalysisAsyncClient.lambda$beginAnalyzeDocument$12(DocumentAnalysisAsyncClient.java:433) ~[?:?]
    at com.azure.ai.formrecognizer.documentanalysis.DocumentAnalysisAsyncClient$$Lambda$1633/0x00000281420c7a88.apply(Unknown Source) ~[?:?]

To Reproduce Steps to reproduce the behavior: Using the prebuilt-read model, simply sending a large TIF file (600+ pages) for OCR should trigger the problem.

Code Snippet

CompletableFuture<AnalyzeResult> cf = new CompletableFuture<>();
Duration timeout = Duration.ofSeconds(120);
DocumentAnalysisAsyncClient clientAsync = new DocumentAnalysisClientBuilder()
      .credential(new AzureKeyCredential("my-key"))
      .endpoint("https://westeurope.api.cognitive.microsoft.com")
      .clientOptions(new HttpClientOptions()
        .setConnectTimeout(timeout)
        .setReadTimeout(timeout)
        .setResponseTimeout(timeout)
        .setWriteTimeout(timeout)
        .setConnectionIdleTimeout(timeout))
      .buildAsyncClient();
clientAsync.beginAnalyzeDocument("prebuilt-read", BinaryData.fromStream(inputStream, contentLength))
            .log(Loggers.getLogger("FormRecognizerEngineLogger"), Level.INFO, false, SignalType.ON_SUBSCRIBE, SignalType.ON_COMPLETE, SignalType.ON_ERROR)
            .flatMap(AsyncPollResponse::getFinalResult)
            .subscribe(cf::complete);

Expected behavior A callback that can be called after each page has been handled in order for us to stream the results to disk without having to load everything in memory before we can get access to the final object.

Screenshots N/A

Setup (please complete the following information):

Additional context N/A

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

github-actions[bot] commented 3 hours ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.