Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.3k stars 1.96k forks source link

[BUG] UndeliverableException when creating resource group and network security group in heavy load #33056

Open wangwenbj opened 1 year ago

wangwenbj commented 1 year ago

Describe the bug We encountered the following errors on heavy load when creating resource group and network security group using Azure Java SDK new version, The Webclient is OkHttpClient. This issue is not happending in the old rxjava version though

Exception or Stack Trace

Exception in thread "RxCachedThreadScheduler-141" io.reactivex.rxjava3.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException at io.reactivex.rxjava3.plugins.RxJavaPlugins.onError(RxJavaPlugins.java:372) at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:49) at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855) at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext.subscribeActual(SingleResumeNext.java:39) at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855) at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89) at io.reactivex.rxjava3.core.Scheduler$DisposeTask.run(Scheduler.java:644) at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:65) at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:56) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

To Reproduce This issue cannot be reproduced eaisly. It happens every now and then in our production env and we have nowhere to catch and handle this issue.

In large scale of resoruce group creation we encounter this issue some times. I have reproduce this only once locally using 100 resource groups provision in parallel.

Code Snippet ResourceGroup.DefinitionStages.WithCreate creator = this.azureResoureManager.resourceGroups().define(resourceGroupName) .withRegion(region); return ReactorToRxV3Interop.monoToSingle(creator.createAsync());

Expected behavior No exception happend or if exception happened we could have a way to catch it inside the reactor chain.

Screenshots API error. No screen shots

Additional context This part of log is what we catch in our customized okhttp interceptor, however, after the exception is thrown, the upper chain lost track of this exception. Which caused the chain to never stop.

2023-01-11T17:05:44.011Z [trace_id=9492315ecd8cdf9e9db291d40c42e57b] [transaction_id=1e99ae844e81ce79] ERROR [gement.azure.com/...] .i.i.AzureResilienceInterceptorImpl.logRetryInfoForError:506 - Exception: java.io.IOException: Canceled at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:221) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

  | stream | stdout

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

joshfree commented 1 year ago

Thank you for reaching out to us via this github issue, @wangwenbj. @weidongxu-microsoft will be able to help route your issue further. Please note that if this problem requires immediate attention, please refer to Azure support plan details here: https://github.com/Azure/azure-sdk-for-java/blob/main/SUPPORT.md#support

weidongxu-microsoft commented 1 year ago

@wangwenbj

What is the version of the SDK? What is the version of azure-core-http-okhttp?

Also, may I ask why choose OkHttpClient over NettyClient?

wangwenbj commented 1 year ago

Hi Weidong,

Belonw is what we are using:

com.azure.resourcemanager azure-resourcemanager 2.19.0 com.azure azure-core-http-netty com.azure azure-identity 1.5.4 com.azure azure-core-http-netty com.azure azure-core-http-okhttp 1.11.1

Let me know if you have any questions.

Best regards, Wen

From: Weidong Xu @.> Date: Friday, January 20, 2023 at 09:24 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

@wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7C748e0a901fc04e6a083208dafa85033a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638097746462629406%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=10RpRiObrwgXnuZK8IE1EawhSlDwb4zW02xQuJoWrxI%3D&reserved=0

What is the version of the SDK?

Also, may I ask why choose OkHttp over Netty?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1397814710&data=05%7C01%7Cwwen%40vmware.com%7C748e0a901fc04e6a083208dafa85033a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638097746462629406%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xspRAkgSqiKfBKiqkA4q8nB%2B3mYgxgdcVH2tlwgMMLw%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XLEBFTKJ7TLKCALB5DWTHSLBANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C748e0a901fc04e6a083208dafa85033a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638097746462629406%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3cybWCgYoh3eC0c4ztNvP%2FNnVDXDej2wVpYJlY6%2Bxs4%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

wangwenbj commented 1 year ago

Hi Weidong,

Any updates? We used Okhttp client in previous Azure SDK and had implements to adjust Azure quota limits, which is pretty hard to change to netty

Best regards, Wen

From: Wen Wang @.> Date: Friday, January 20, 2023 at 13:44 To: Azure/azure-sdk-for-java @.>, Azure/azure-sdk-for-java @.> Cc: Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) Hi Weidong,

Belonw is what we are using:

com.azure.resourcemanager azure-resourcemanager 2.19.0 com.azure azure-core-http-netty com.azure azure-identity 1.5.4 com.azure azure-core-http-netty com.azure azure-core-http-okhttp 1.11.1

Let me know if you have any questions.

Best regards, Wen

From: Weidong Xu @.> Date: Friday, January 20, 2023 at 09:24 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

@wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7C748e0a901fc04e6a083208dafa85033a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638097746462629406%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=10RpRiObrwgXnuZK8IE1EawhSlDwb4zW02xQuJoWrxI%3D&reserved=0

What is the version of the SDK?

Also, may I ask why choose OkHttp over Netty?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1397814710&data=05%7C01%7Cwwen%40vmware.com%7C748e0a901fc04e6a083208dafa85033a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638097746462629406%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xspRAkgSqiKfBKiqkA4q8nB%2B3mYgxgdcVH2tlwgMMLw%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XLEBFTKJ7TLKCALB5DWTHSLBANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C748e0a901fc04e6a083208dafa85033a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638097746462629406%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3cybWCgYoh3eC0c4ztNvP%2FNnVDXDej2wVpYJlY6%2Bxs4%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Hi @wangwenbj , I've tried creating 100 resource groups multiple times but not able to reproduce the issue...

You can refer to this doc for throttling control.

P.S. You don't have to write your own ReactorToRxV3Interop. There's official support for converting Mono to Rxjava3 Single.

wangwenbj commented 1 year ago

Hi Xiaofei,

I could not reproduce this issue easily locally as well, and this issue keeps occurring like, everyday. Any think you could think of that caused this issue?

  1. For the Azure client, we used Okhttp clients before and have several interceptors implemented. It could be huge work if we switch it Netty, Please kindly take a look how we could handle this in OkHttpClient
  2. Thanks for the information of the Reactor adaptor, we could make this change.

Best regards, Wen

From: Xiaofei Cao @.> Date: Monday, January 30, 2023 at 15:55 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Hi @wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7C262b559b22f74106164708db029756b5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106621246637674%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WZz8ymiTUXPwWPpCKVGNs29%2B4r5V91t6GHSnCKWIjzI%3D&reserved=0 , I've tried creating 100 resource groups multiple times but not able to reproduce the issue...

You can refer to this dochttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fblob%2Fmain%2Fsdk%2Fresourcemanager%2Fdocs%2FTHROTTLING.md&data=05%7C01%7Cwwen%40vmware.com%7C262b559b22f74106164708db029756b5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106621246657585%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4LrwqdTZUA1IhV9eSruwfXJklKlZVdFK6w6MdzXj3b8%3D&reserved=0 for throttling control.

P.S. You don't have to write your own ReactorToRxV3Interop. There's official supporthttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprojectreactor.io%2Fdocs%2Fadapter%2Frelease%2Fapi%2Freactor%2Fadapter%2Frxjava%2FRxJava3Adapter.html&data=05%7C01%7Cwwen%40vmware.com%7C262b559b22f74106164708db029756b5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106621246667544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OiclIjZggsqsiMdKmmSPkA4vemLniP%2FzAgcztbuefiU%3D&reserved=0 for converting Mono to Rxjava3 Single.

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1408138767&data=05%7C01%7Cwwen%40vmware.com%7C262b559b22f74106164708db029756b5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106621246667544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fa1rc0vKchDt5BAh4H6f%2Fzs%2BwreAlaXhSRuFQlVO00w%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XMSGUJI5BBZU36N3YDWU5XWTANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C262b559b22f74106164708db029756b5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106621246677501%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CKTCAnyDrjB7oDxEUQHCi7JRw1BZF%2BOpqW1sp4%2B2Gig%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

OK, got it.

Any think you could think of that caused this issue?

I'm not sure. From the log I can't tell the root cause of the exception. And for your description:

after the exception is thrown, the upper chain lost track of this exception. Which caused the chain to never stop.

I don't quite understand, can you elaborate on this? What do you mean by never stop?

wangwenbj commented 1 year ago

Hi Xiaofei,

Here’s some context of this issue. Let me know if this still not answer your questions.

  1. Logged error. Jan 11, 2023 @ 17:05:44.010 Exception in thread "RxCachedThreadScheduler-141" io.reactivex.rxjava3.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException at io.reactivex.rxjava3.plugins.RxJavaPlugins.onError(RxJavaPlugins.java:372) at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:49) at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855) at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext.subscribeActual(SingleResumeNext.java:39) at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855) at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89) at io.reactivex.rxjava3.core.Scheduler$DisposeTask.run(Scheduler.java:644) at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:65) at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:56) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

Jan 11, 2023 @ 17:05:44.010

io.reactivex.rxjava3.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException at io.reactivex.rxjava3.plugins.RxJavaPlugins.onError(RxJavaPlugins.java:372) at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:49) at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855) at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext.subscribeActual(SingleResumeNext.java:39) at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855) at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89) at io.reactivex.rxjava3.core.Scheduler$DisposeTask.run(Scheduler.java:644) at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:65) at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:56) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException at reactor.core.Exceptions.propagate(Exceptions.java:396) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:91) at reactor.core.publisher.Mono.block(Mono.java:1742) at com.azure.resourcemanager.resources.implementation.DeploymentsClientImpl.checkExistence(DeploymentsClientImpl.java:7569) at com.azure.resourcemanager.resources.implementation.DeploymentsImpl.checkExistence(DeploymentsImpl.java:102) at com.vmware.horizon.sg.clouddriver.impl.azure.v2.operator.DeploymentOperator.isDeploymentExist(DeploymentOperator.java:46) at com.vmware.horizon.sg.clouddriver.impl.azure.v2.CloudDriverAzureV2.lambda$isDeploymentExist$29(CloudDriverAzureV2.java:794) at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:43) ... 12 more Caused by: java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown Source) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source) at java.base/java.util.concurrent.CountDownLatch.await(Unknown Source) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87) ... 18 more Jan 11, 2023 @ 17:05:44.011

[trace_id=9492315ecd8cdf9e9db291d40c42e57b] [transaction_id=1e99ae844e81ce79] ERROR [gement.azure.com/...] .i.i.AzureResilienceInterceptorImpl.logRetryInfoForError:506 - Exception: java.io.IOException: Canceled at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:221) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Jan 11, 2023 @ 17:05:44.011

returnCode=uioe HEAD https://management.azure.com/subscriptions/da5d81e1-1138-4026-a898-ce9a1ad280d1/resourcegroups/vmw-hcs-63bed6164b4f5924004bd0bd-63bee7b53d8ff07a670fdc16/providers/Microsoft.Resources/deployments/vmw-hcs-63bed6164b4f5924004bd0bd-63bee7b53d8ff07a670fdc16-nsg?api-version=2021-01-01 ... 18 more Jan 11, 2023 @ 17:05:44.011

            [trace_id=9492315ecd8cdf9e9db291d40c42e57b] [transaction_id=1e99ae844e81ce79] INFO  [gement.azure.com/...] okhttp3.OkHttpClient.log:133                                 - --> HEAD https://management.azure.com/subscriptions/da5d81e1-1138-4026-a898-ce9a1ad280d1/resourcegroups/vmw-hcs-63bed6164b4f5924004bd0bd-63bee7b53d8ff07a670fdc16/providers/Microsoft.Resources/deployments/vmw-hcs-63bed6164b4f5924004bd0bd-63bee7b53d8ff07a670fdc16-nsg?api-version=2021-01-01
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            at java.base/java.lang.Thread.run(Unknown Source)

Caused by: reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException at reactor.core.Exceptions.propagate(Exceptions.java:396) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:91) at reactor.core.publisher.Mono.block(Mono.java:1742) at com.azure.resourcemanager.resources.implementation.DeploymentsClientImpl.checkExistence(DeploymentsClientImpl.java:7569) at com.azure.resourcemanager.resources.implementation.DeploymentsImpl.checkExistence(DeploymentsImpl.java:102) at com.vmware.horizon.sg.clouddriver.impl.azure.v2.operator.DeploymentOperator.isDeploymentExist(DeploymentOperator.java:46) at com.vmware.horizon.sg.clouddriver.impl.azure.v2.CloudDriverAzureV2.lambda$isDeploymentExist$29(CloudDriverAzureV2.java:794) at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:43) ... 12 more Caused by: java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown Source) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source) at java.base/java.util.concurrent.CountDownLatch.await(Unknown Source) Jan 11, 2023 @ 17:05:44.011

[trace_id=9492315ecd8cdf9e9db291d40c42e57b] [transaction_id=1e99ae844e81ce79] INFO [gement.azure.com/...] okhttp3.OkHttpClient.log:133 - <-- HTTP FAILED: java.io.IOException: Canceled

  1. How to reproduce it. Provision multiple resource groups , after which, create NSG, in parallel, in heavy load.

  2. How the follow work in creating NSG. After creating NSG, we create key vault, and then do some network (Subnet) queries, and then create key vault if demanded.

  3. Can you please share a sample of code. Create resource group public Single createAsync(String resourceGroupName, String region, Map<String, String> tags) { ResourceGroup.DefinitionStages.WithCreate creator = this.azure.resourceGroups().define(resourceGroupName) .withRegion(region); if (MapUtils.isNotEmpty(tags)) { creator.withTags(tags); } return ReactorToRxV3Interop.monoToSingle(creator.createAsync()); }

Check NSG exist public Single getSecurityGroupByName(String securityGroupName, String resourceGroupName) { return ReactorToRxV3Interop.monoToSingle(this.networkSecurityGroups() .getByResourceGroupAsync(resourceGroupName, securityGroupName) .filter(Objects::nonNull)); }

Create public Single createAsyncByArmTemplate(String deploymentName, String resourceGroupName, ARMTemplateBuilder.Template armTemplate) throws IOException { Mono deploymentMono = this.deployments().define(deploymentName) .withExistingResourceGroup(resourceGroupName) .withTemplate(armTemplate.template) .withParameters(armTemplate.parameters) .withMode(DeploymentMode.INCREMENTAL) .createAsync(); return ReactorToRxV3Interop.monoToSingle(deploymentMono); }

Reactor to Rxjava interoperators public class ReactorToRxV3Interop {

public static <T> Single<T> monoToSingle(Mono<T> singleSource) {
    return Single.fromPublisher(singleSource);
}

public static <T> Observable<T> fluxToObservable(Flux<T> fluxSource) {
    return Flowable.fromPublisher(fluxSource).toObservable();
}

public static <T> Completable monoToCompletable(Mono<T> monoSource) {
    return new CompletableFromPublisher<>(monoSource);
}

}

  1. Can you please share with us the name of the new and old SDK. Old SDK: com.microsoft.azure azure 1.41.3

    New SDK:

    com.azure.resourcemanager azure-resourcemanager 2.19.0 com.azure azure-core-http-netty

Best regards, Wen

From: Xiaofei Cao @.> Date: Monday, January 30, 2023 at 16:51 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

OK, got it.

Any think you could think of that caused this issue?

I'm not sure. From the log I can't tell the root cause of the exception. And for your description:

after the exception is thrown, the upper chain lost track of this exception. Which caused the chain to never stop.

I don't quite understand, can you elaborate on this?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1408205460&data=05%7C01%7Cwwen%40vmware.com%7C7921e13399b34fb3194f08db029f34d1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106655035594925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vYkW8JsyTOvawfIChOA250mDlgQO20STAziKEC%2FEx2Y%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XJ6YSNBPFB4BDQWMHDWU56JZANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C7921e13399b34fb3194f08db029f34d1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106655035594925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y%2B5NLOmkpFWCn%2BmYwc4%2Bq7fatCWrcm91XB5Kh3mjYis%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Thanks @wangwenbj

I saw a blocking get operation got canceled in DynamicThrottleInterceptor(Exception: java.io.IOException: Canceled), and InterruptedException is thrown. You may want some special error-handling here, described by Rxjava3 error-handling:

In addition, some 3rd party libraries/code throw when they get interrupted by a cancel/dispose call which leads to an undeliverable exception most of the time. Internal changes in 2.0.6 now consistently cancel or dispose a Subscription/Disposable before cancelling/disposing a task or worker (which causes the interrupt on the target thread).

// in some library
try {
doSomethingBlockingly()
} catch (InterruptedException ex) {
// check if the interrupt is due to cancellation
// if so, no need to signal the InterruptedException
if (!disposable.isDisposed()) {
observer.onError(ex);
}
}

If the library/code already did this, the undeliverable InterruptedExceptions should stop now. If this pattern was not employed before, we encourage updating the code/library in question.

By the way, could you show me the codesnippet of DynamicThrottleInterceptor please?

wangwenbj commented 1 year ago

Thanks,, Xiaofei,

I checked in the call sequence; we did not generate any Rx objects so there’s no way we could generate this issue in our code.

For the DynamicThrottleInterceptor, I modified the code to provide the logic. It’s basically used to prevent the 429 issue and do delay before it happen. Please check:

@Override

public @NotNull Response @.*** Chain chain) throws IOException {

Request request = chain.request();

String requestMethod = request.method();

String requestUrl = request.url().toString();

// Record our recognized operationType for dev purpose

Set<String> requestOperationType = getQuotaTypes(requestMethod, requestUrl);

long delay = getQuotaDelay(requestMethod, requestUrl, clientId);

if (delay > 0) {

    throw new Exception();

}

Response response = chain.proceed(request); // Call Azure.

return response;

}

Best regards,

Wen

From: Xiaofei Cao @.> Date: Monday, January 30, 2023 at 18:21 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Thanks @wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7C83320e3673fc493aff7308db02abc911%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106709069241311%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IpDP0Nh%2B60LtIO%2BkyKi0Lj%2FPI25nxTNEbiUaYTdo1mA%3D&reserved=0

I saw a blocking get operation got canceled in DynamicThrottleInterceptor(Exception: java.io.IOException: Canceled), and InterruptedException is thrown. You may want some special error-handling here, described by Rxjava3 error-handlinghttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FReactiveX%2FRxJava%2Fwiki%2FWhat%27s-different-in-2.0%23error-handling&data=05%7C01%7Cwwen%40vmware.com%7C83320e3673fc493aff7308db02abc911%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106709069397537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eOMzLyQ1gexGCpy8k6SrN9h%2BJA8Q2R8YDANlrsD%2BoiA%3D&reserved=0:

In addition, some 3rd party libraries/code throw when they get interrupted by a cancel/dispose call which leads to an undeliverable exception most of the time. Internal changes in 2.0.6 now consistently cancel or dispose a Subscription/Disposable before cancelling/disposing a task or worker (which causes the interrupt on the target thread).

// in some library

try {

doSomethingBlockingly()

} catch (InterruptedException ex) {

// check if the interrupt is due to cancellation

// if so, no need to signal the InterruptedException

if (!disposable.isDisposed()) {

  observer.onError(ex);

}

}

If the library/code already did this, the undeliverable InterruptedExceptions should stop now. If this pattern was not employed before, we encourage updating the code/library in question.

By the way, could you show me the codesnippet of DynamicThrottleInterceptor please?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1408351061&data=05%7C01%7Cwwen%40vmware.com%7C83320e3673fc493aff7308db02abc911%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106709069397537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CtmcLa2RUCGxs7xfh9KkS1DRIndZNcyI6YMpOuLv1Uo%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XL4K4IDSSIR7O6MFPTWU6I3PANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C83320e3673fc493aff7308db02abc911%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638106709069397537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HHd5E6GRG2Lk5798mObtQG0hO8CA%2BLOjiVt7C5XMKdM%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

OK, our track1 lib uses Rxjava and your code uses Rxjava3. There is a difference in error handling since Rxjava2, especially those undeliverable:

One important design requirement for 2.x is that no Throwable errors should be swallowed. This means errors that can't be emitted because the downstream's lifecycle already reached its terminal state or the downstream cancelled a sequence which was about to emit an error.

My best guess is that this error actually happens in old rxjava but got swallowed. You can try adding a global error handler to handle this specific exception based on whether they represent a likely bug or an ignorable application/network state in Rxjava3 described in error-handling:

RxJavaPlugins.setErrorHandler(e -> {
    if (e instanceof UndeliverableException) {
        e = e.getCause();
    }
    if ((e instanceof IOException) || (e instanceof SocketException)) {
        // fine, irrelevant network problem or API that throws on cancellation
        return;
    }
    if (e instanceof InterruptedException) {
        // fine, some blocking code was interrupted by a dispose call
        return;
    }
    if ((e instanceof NullPointerException) || (e instanceof IllegalArgumentException)) {
        // that's likely a bug in the application
        Thread.currentThread().getUncaughtExceptionHandler()
            .handleException(Thread.currentThread(), e);
        return;
    }
    if (e instanceof IllegalStateException) {
        // that's a bug in RxJava or in a custom operator
        Thread.currentThread().getUncaughtExceptionHandler()
            .handleException(Thread.currentThread(), e);
        return;
    }
    Log.warning("Undeliverable exception received, not sure what to do", e);
});
wangwenbj commented 1 year ago

Thanks, Xiaofei,

Sure, I tried the RxJavaPlugins global error handler, however, when this UndeliverableException happened, global handler could not stop the Rx chain and it will not help the timeout. What I really need is to stop the blocking wait when error happened and handle the error accordingly. Also, we use the same code piece for this logic and same code piece for transferring rxjava to rxjava3 when we use the previous Azure SDK and it worked fine for years.

Considering the scenarios above, could you please continue the investigation on the SDK itself with OKHttpClient? We are blocked on our side for the investigation. I will try changing the client to Netty which could be lot’s of effort and could take some time. Before we finally eliminate the issue, could you please continuously help us on this one? I think the following perspectives are what we could focus:

  1. Azure Java SDK Reactor implementation vs the previous one over OkHttpClient
  2. OkhttpClient vs Netty. What’s the difference for the two clients.

Let me know if you have any questions.

Best regards, Wen

From: Xiaofei Cao @.> Date: Tuesday, January 31, 2023 at 16:54 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

OK, our track1 lib uses Rxjava and your code uses Rxjava3. There is a difference in error handling since Rxjava2, especially those undeliverable:

One important design requirement for 2.x is that no Throwable errors should be swallowed. This means errors that can't be emitted because the downstream's lifecycle already reached its terminal state or the downstream cancelled a sequence which was about to emit an error.

My best guess is that this error actually happens in old rxjava but got swallowed. You can try adding a global error handler to handle this specific exception based on whether they represent a likely bug or an ignorable application/network state in Rxjava3 described in error-handlinghttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FReactiveX%2FRxJava%2Fwiki%2FWhat%27s-different-in-2.0%23error-handling&data=05%7C01%7Cwwen%40vmware.com%7C3ee3e801955642ee7d3508db0368c4cd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638107520748572106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=T1u84dhUbrHoF07xoGSquaIcVgRrI8wkQzATFSbkfDQ%3D&reserved=0:

RxJavaPlugins.setErrorHandler(e -> {

if (e instanceof UndeliverableException) {

    e = e.getCause();

}

if ((e instanceof IOException) || (e instanceof SocketException)) {

    // fine, irrelevant network problem or API that throws on cancellation

    return;

}

if (e instanceof InterruptedException) {

    // fine, some blocking code was interrupted by a dispose call

    return;

}

if ((e instanceof NullPointerException) || (e instanceof IllegalArgumentException)) {

    // that's likely a bug in the application

    Thread.currentThread().getUncaughtExceptionHandler()

        .handleException(Thread.currentThread(), e);

    return;

}

if (e instanceof IllegalStateException) {

    // that's a bug in RxJava or in a custom operator

    Thread.currentThread().getUncaughtExceptionHandler()

        .handleException(Thread.currentThread(), e);

    return;

}

Log.warning("Undeliverable exception received, not sure what to do", e);

});

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1409983138&data=05%7C01%7Cwwen%40vmware.com%7C3ee3e801955642ee7d3508db0368c4cd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638107520748572106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=cuHwOtH5G5vZ5M4tp2Tkm8C1tJApnE8LTGR1AZXbFlI%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XKLABRXKFJ6F4RRNITWVDHMPANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C3ee3e801955642ee7d3508db0368c4cd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638107520748572106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=cs2KKux%2BAXOtN0Kh5HEhxAdr7TFo4crPb5nlNZcJKQ8%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Sure. Would you help me confirm line 80 code content of DynamicThrottleInterceptor? I assume the exception initiated here?

at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80)

wangwenbj commented 1 year ago

Really appreciated, Xiaofei,

The line 80 is: Response response = chain.proceed(request); // Call Azure. Which is perform call to Azure service

Best regards, Wen

From: Xiaofei Cao @.> Date: Wednesday, February 1, 2023 at 14:00 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Sure. Would you help me confirm line 80 of DynamicThrottleInterceptor? I assume the exception initiated here?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1411508737&data=05%7C01%7Cwwen%40vmware.com%7C51d5b6dc33ef410fcab108db0419a1b1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638108280350450115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=auAVjgATtLcHv4dL2Z0KPP3Gte6%2BuUqlDZGbC6S7lvg%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XOAMFR37FNHOBKTT7LWVH3YDANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C51d5b6dc33ef410fcab108db0419a1b1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638108280350450115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DdXExS8M9Fwjh%2BEoKebbNChN4n0SElnrEV6Hzusjrzo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Hi @wangwenbj , I saw a very similar situation where the chain got stalled when throwing an non-IOException in interceptor: https://github.com/square/retrofit/issues/3453

I wonder if this is the case? What did you do with the exception after you logged it in your custom interceptor(AzureResilienceInterceptorImpl.logRetryInfoForError)? Did you wrapped it into some other non-IOException?

wangwenbj commented 1 year ago

Thanks, Xiaofei,

We do have thrown an Exception extends RuntimeException instead of IOException in the interceptor. Let me do the change and see if this will help. Also, this impl has been there for a long time and I am just curious why this issue is not happening in the old rxjava version SDK?

Best regards, Wen

From: Xiaofei Cao @.> Date: Thursday, February 2, 2023 at 16:04 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Hi @wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7C45e51efe59dd4e16682408db04f42bba%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638109218979172687%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=0bg6tR2DeW%2BW%2B917KYixbJfhs55PsBpOWHTx%2FRvLqLc%3D&reserved=0 , I saw a very similar situation where the chain got stalled when throwing an non-IOException in interceptor: square/retrofit#3453https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsquare%2Fretrofit%2Fissues%2F3453&data=05%7C01%7Cwwen%40vmware.com%7C45e51efe59dd4e16682408db04f42bba%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638109218979172687%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LeKOSVitgxiGs7iZvXuKvLtzFe9DsE21clqw2BULFSc%3D&reserved=0

I wonder if this is the case? What did you do with the exception after you logged it in your custom interceptor(AzureResilienceInterceptorImpl.logRetryInfoForError)? Did you wrapped it into some other non-IOException?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1413302844&data=05%7C01%7Cwwen%40vmware.com%7C45e51efe59dd4e16682408db04f42bba%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638109218979172687%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QbBhiIXhqfwLGVimg0wtWuc5xjVsLk51kbhC6v8TRo4%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XP45L5Z34OJLB6U5E3WVNTCPANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C45e51efe59dd4e16682408db04f42bba%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638109218979172687%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4R838munK2UU2mfqUiuAWrIo7k9VEA6Az6POjeH5gko%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Hi @wangwenbj

why this issue is not happening in the old rxjava version SDK?

I'm not sure. Are you using the same version of Okhttp3 as before?

My other speculation is that the Rxjava->Rxjava3 adaptor that you used before behaves differently than the Reactor->Rxjava3 adaptor you are using now. This is pure speculation...

General good practice(from their official doc) is that you don't throw your own exceptions in Interceptors, IOExceptions or not. Instead, if you want to signal a failure, use synthetic http responses:

 @Throws(IOException::class)
 override fun intercept(chain: Interceptor.Chain): Response {
   if (myConfig.isInvalid()) {
     return Response.Builder()
         .request(chain.request())
         .protocol(Protocol.HTTP_1_1)
         .code(400)
         .message("client config invalid")
         .body("client config invalid".toResponseBody(null))
         .build()
   }

   return chain.proceed(chain.request())
 }
wangwenbj commented 1 year ago

Anyways, thank you Xiaofei,

Let’s keep this issue open and we have already made the change and let’s see if it helps or not. I will update the thread once we got some conclusions. It will take some time.

Best regards, Wen

From: Xiaofei Cao @.> Date: Wednesday, February 8, 2023 at 14:50 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Hi @wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7C427871217882475efad508db09a0d1c0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638114358538797592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NeRDDi9CBmBiw5G9UWmfUrBKfaRiUUb%2FShs7DHYQ6HM%3D&reserved=0

why this issue is not happening in the old rxjava version SDK?

I'm not sure. Are you using the same version of Okhttp3 as before?

My other speculation is that the Rxjava->Rxjava3 adaptor that you used before behaves differently than the Reactor->Rxjava3 adaptor you are using now. This is pure speculation...

General good practice(from their official dochttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsquare%2Fokhttp%2Fblob%2F3ad1912f783e108b3d0ad2c4a5b1b89b827e4db9%2Fokhttp%2Fsrc%2FjvmMain%2Fkotlin%2Fokhttp3%2FInterceptor.kt%23L40-L57&data=05%7C01%7Cwwen%40vmware.com%7C427871217882475efad508db09a0d1c0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638114358538797592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fNEZ0nH4k8eJcHx8%2BZqZV5ijQVUGy3jPpZp3zJvkMwk%3D&reserved=0) is that you don't throw your own exceptions in Interceptors. Instead, if you want to signal a failure, use synthetic http responses:

@Throws(IOException::class)

override fun intercept(chain: Interceptor.Chain): Response {

if (myConfig.isInvalid()) {

 return Response.Builder()

     .request(chain.request())

     .protocol(Protocol.HTTP_1_1)

     .code(400)

     .message("client config invalid")

     .body("client config invalid".toResponseBody(null))

     .build()

}

return chain.proceed(chain.request())

}

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1422097300&data=05%7C01%7Cwwen%40vmware.com%7C427871217882475efad508db09a0d1c0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638114358538797592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=49yBJFITbunX9pmoFO1YjG%2FE3jw%2FMWVvakKGNeG1a%2Bk%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XOPGVMZIEM2H6KFWV3WWM64XANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C427871217882475efad508db09a0d1c0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638114358538797592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lNu0Ja1n7INmEIlY%2F95c2pp32YKLDF0sAh6EFbomtII%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

wangwenbj commented 1 year ago

Hi Xiaofei,

We updated the exception which extends from IOException istead of RuntimeException and this issue still occurs. From my perspective, before we throw any of the exceptions, this issue happens. The error is as follows:

2023-02-14T18:18:37.744Z [trace_id=4277b25640ee77e705a653b02f8e1f64] [transaction_id=56e2270d14efa859] ERROR [gement.azure.com/...] .i.i.AzureResilienceInterceptorImpl.logRetryInfoForError:506 - Exception: java.io.IOException: Canceled at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:221) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

Best regards, Wen

From: Wen Wang @.> Date: Thursday, February 9, 2023 at 12:53 To: Azure/azure-sdk-for-java @.>, Azure/azure-sdk-for-java @.> Cc: Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) Anyways, thank you Xiaofei,

Let’s keep this issue open and we have already made the change and let’s see if it helps or not. I will update the thread once we got some conclusions. It will take some time.

Best regards, Wen

From: Xiaofei Cao @.> Date: Wednesday, February 8, 2023 at 14:50 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Hi @wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7C427871217882475efad508db09a0d1c0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638114358538797592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NeRDDi9CBmBiw5G9UWmfUrBKfaRiUUb%2FShs7DHYQ6HM%3D&reserved=0

why this issue is not happening in the old rxjava version SDK?

I'm not sure. Are you using the same version of Okhttp3 as before?

My other speculation is that the Rxjava->Rxjava3 adaptor that you used before behaves differently than the Reactor->Rxjava3 adaptor you are using now. This is pure speculation...

General good practice(from their official dochttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsquare%2Fokhttp%2Fblob%2F3ad1912f783e108b3d0ad2c4a5b1b89b827e4db9%2Fokhttp%2Fsrc%2FjvmMain%2Fkotlin%2Fokhttp3%2FInterceptor.kt%23L40-L57&data=05%7C01%7Cwwen%40vmware.com%7C427871217882475efad508db09a0d1c0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638114358538797592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fNEZ0nH4k8eJcHx8%2BZqZV5ijQVUGy3jPpZp3zJvkMwk%3D&reserved=0) is that you don't throw your own exceptions in Interceptors. Instead, if you want to signal a failure, use synthetic http responses:

@Throws(IOException::class)

override fun intercept(chain: Interceptor.Chain): Response {

if (myConfig.isInvalid()) {

 return Response.Builder()

     .request(chain.request())

     .protocol(Protocol.HTTP_1_1)

     .code(400)

     .message("client config invalid")

     .body("client config invalid".toResponseBody(null))

     .build()

}

return chain.proceed(chain.request())

}

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1422097300&data=05%7C01%7Cwwen%40vmware.com%7C427871217882475efad508db09a0d1c0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638114358538797592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=49yBJFITbunX9pmoFO1YjG%2FE3jw%2FMWVvakKGNeG1a%2Bk%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XOPGVMZIEM2H6KFWV3WWM64XANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C427871217882475efad508db09a0d1c0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638114358538797592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lNu0Ja1n7INmEIlY%2F95c2pp32YKLDF0sAh6EFbomtII%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Thanks @wangwenbj , and the UndeliverableException still persists?

wangwenbj commented 1 year ago

Hi Xiaofei,

The previous UndeliverableException is gone, however, currently it’s IOException: Cancelled that is causing the same failure. 2023-02-14T18:18:37.744Z [trace_id=4277b25640ee77e705a653b02f8e1f64] [transaction_id=56e2270d14efa859] ERROR [gement.azure.com/...] .i.i.AzureResilienceInterceptorImpl.logRetryInfoForError:506 - Exception: java.io.IOException: Canceled at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:221) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

Could you please help with checking? Thanks!

Wen From: Xiaofei Cao @.> Date: Wednesday, February 15, 2023 at 16:17 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Thanks @wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7Ca2a646cccb51430921d308db0f2d0a4a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638120458395738263%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=up%2FGz9Vdc5dwpfrD3kCj7k%2FudP6bUJbxh3N3YCoxJG0%3D&reserved=0 , and the UndeliverableException still persists?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1430921827&data=05%7C01%7Cwwen%40vmware.com%7Ca2a646cccb51430921d308db0f2d0a4a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638120458395738263%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xSSw5e7JeF%2BLDBDC1MKI0aixQLDq3pJjAG8KfZvGnXs%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XM56BFUNKUZQLQ26BDWXSGIPANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7Ca2a646cccb51430921d308db0f2d0a4a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638120458395738263%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3%2FJ7sytHXxa4oTVJyN7IH5MeviufKjHKE%2FiSXtm%2FDYo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

I see. IOException is the same as before, not sure where did this happen. Will look into it.

And may I ask does your application still stall due to the exception?

wangwenbj commented 1 year ago

Hi Xiaofei,

This issue still occur everyday (Not every time) in our stress testing. This issue has been marked as blocker and we would very like that this issue could be treated in priority. Thanks!

Best regards, Wen

From: Xiaofei Cao @.> Date: Friday, February 17, 2023 at 16:55 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

I see. IOException is the same as before, not sure where did this happen. Will look into it.

And may I ask does your application still stall due to the exception?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1434330419&data=05%7C01%7Cwwen%40vmware.com%7Cbfb8072625f24e1f276c08db10c4ba59%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638122209348597826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xc4aS%2F8k4CEYXjt%2FhNNNc%2BjTJxEMOw%2FO2P4aPmMkUIU%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XOYDMUIGV4FQHSPDOLWX44IJANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7Cbfb8072625f24e1f276c08db10c4ba59%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638122209348597826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y2lrz2KaTLTMqPBLP08WP3Lfw2nZF3573ch7DKMlo%2Bk%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Hi @wangwenbj

I've created a repo to reproduce the IOException: https://github.com/XiaofeiCao/ioexception_repro/blob/main/src/test/java/com/azure/resourcemanager/repro/ioexception/test/BatchCreateResourceGroupTests.java

I tried creating 100 resource groups for 100 times. Currently I can't reproduce the bug before hitting the rate limit..

I might need to simulate your stress test. Is there anything I need to change, like with the OkHttpClient configuration (connection pool size, thread pool size, etc)?

zlishaojiez commented 1 year ago

Hi, @XiaofeiCao The IOException is caused by the okhttp client calls timeout. Did the new Azure SDK(Reactor version) put the okhttp client IOException in the Callback onFailure function to the reactor (Mono or Flux) error chain?

package okhttp3

import java.io.IOException

interface Callback { /**

XiaofeiCao commented 1 year ago

Thanks @zlishaojiez , you mean here? https://github.com/Azure/azure-sdk-for-java/blob/00e2e72c82b9804e3b726ff5aa93465cbc3a613a/sdk/core/azure-core-http-okhttp/src/main/java/com/azure/core/http/okhttp/OkHttpAsyncHttpClient.java#L251-L261

XiaofeiCao commented 1 year ago

Hi @wangwenbj @zlishaojiez ,

A quick sync, I was able to reproduce the UndeliverableException along with the IOException: https://github.com/XiaofeiCao/ioexception_repro/tree/main/src/test/java/com/azure/resourcemanager/repro/ioexception/test/undeliverable

I'm starting to look into the UndeliverableException from here.

Meanwhile, I was able to catch the timeout InterruptedException in doOnError on the Single and the chain can be stopped properly:

        try {
            Single.fromPublisher(
                    manager.resourceGroups()
                            .define(RG_NAME)
                            .withRegion(Region.US_WEST)
                            .createAsync())
                    // doOnError on Single
                    .doOnError(throwable -> System.out.println("error encountered, type: " + throwable.getClass()))
                    .blockingSubscribe();
        } finally {
            // ensure the chain is correctly stopped
            System.out.println("finished");
        }

"error encountered" and "finished" can be successfully printed out.

Though I did experience the chain finished successfully instead of throwing out the InterruptedException.

However, without the extra Single wrapping, it exited with Exception(which is the correct behavior):

manager.resourceGroups()
       .define(RG_NAME)
       .withRegion(Region.US_WEST)
       .createAsync()
       .block();

Is this what you mean by the upper chain lost track of this exception?

XiaofeiCao commented 1 year ago

Sorry, the previous UndeliverableException is the behavior of blockingSubscribe.

If the current Single signals an error, the Throwable is routed to the global error handler via RxJavaPlugins.onError(Throwable). If the current thread is interrupted, an InterruptedException is routed to the same global error handler.

After switching to blockingGet, no UndeliverableException appears and the InterruptedIOException can correctly be thrown.

Old track1 SDK has calltimeout always set to 0(@weidongxu-microsoft correct me if I'm wrong), which is means no calltimeout. This should be why you didn't experience InterruptedIOException in track1 SDK.

Let me know if this is still a blocking issue to you.

weidongxu-microsoft commented 1 year ago

I think track1 does not set callTimeout (so it probably the default 0), but connect/read timeout is set https://github.com/Azure/autorest-clientruntime-for-java/blob/master/client-runtime/src/main/java/com/microsoft/rest/RestClient.java#L264-L265

wangwenbj commented 1 year ago

Hi Xiaofei, Weidong,

Thanks for the reply. From the previous code snippet. It looks similar when this issue happened. Let me illustrate where we are and move it forward.

  1. Trak1 is the rxjava version of Azure SDK, right? Do we have plan to update the reactor SDK according to the old?
  2. When we use the RxJava version of Azure SDK, our interface if RxJavaV3 and we translate all the RxJavaV1 output into RxjavaV3
  3. When we use the Reactor version of Azure SDK, we translate the Reactor interface to RxJavaV3 which is compatible to our service.
  4. We do use some blocking methods to switch the async calls to sync result, in some async RxJava chains, I wonder if this will cause some issue?

Best regards, Wen

From: Weidong Xu @.> Sent: Wednesday, March 1, 2023 1:04 PM To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>; Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056)

!! External Email

I think track1 does not set callTimeout (so it probably the default 0), but connect/read timeout is set https://github.com/Azure/autorest-clientruntime-for-java/blob/master/client-runtime/src/main/java/com/microsoft/rest/RestClient.java#L264-L265https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fautorest-clientruntime-for-java%2Fblob%2Fmaster%2Fclient-runtime%2Fsrc%2Fmain%2Fjava%2Fcom%2Fmicrosoft%2Frest%2FRestClient.java%23L264-L265&data=05%7C01%7Cwwen%40vmware.com%7C24c70e1718f04b3e8d3408db1a125af6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638132438381140909%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Hi2%2F7GlFLCov3jcvy8JryWh%2Bydkk0k77z13MNXHzAzU%3D&reserved=0

- Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1449353546&data=05%7C01%7Cwwen%40vmware.com%7C24c70e1718f04b3e8d3408db1a125af6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638132438381140909%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Yowwn2jXXE1gnr8VqSx%2Fob5wu9SN09PI1iRLdUSG4W8%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XNWJMTI6A5J4BI23Q3WZ3KDTANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C24c70e1718f04b3e8d3408db1a125af6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638132438381140909%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=56vu8OQc0pQ0piN1Pw%2Fh1r6l0nOBZduzn%2FEioQ0yFgs%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Hi wen,

Thanks for the clarification. Rxjava translators should be fine.

For 1, I don't think so since our track1 SDK is officially deprecated since March 2022. For 4, do you mean

chain.blockingGet()

or

chain.map(v -> 
    {
        anotherChain.blockingGet();
        return v;
    }

? The latter is not correct since one shouldn't do sync blocks inside a chain. Some codesnippet would be helpful for us to better understand your situation.

Another thing is, have you set any callTimeouts to OkHttpClient(or OkHttpAsyncHttpClient)? We can't control the timeout exception since it's directly from OkHttp itself. You could set the calltimeout to a higher value if this is the case.

XiaofeiCao commented 1 year ago

Also, I saw from the stacktrace that there's a blockingGet in AzureResilienceInterceptorImpl:

com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)

Usually it should be fine to do blocking calls in Okhttp interceptors. However, if you could share what you do with the blockingGet, it would help us better understand the situation. You could do that in my personal repo: https://github.com/XiaofeiCao/ioexception_repro, or email me if that's possible.

Further question, does this line always appear in the exception's stacktrace? Or does the exception happen somewhere else too? If so, could you share the stacktrace?

wangwenbj commented 1 year ago

Thnaks Xiaofei,

For the blocking call example. Your sample code is what I described in the thread. For the AzureResilienceInterceptorImpl, we have this:

response = chain.proceed(request); // Call Azure.

And this is the expected rest call to Azure in the OkhttpClient interceptor. We did not use blocking calls here anyways. We didn’t set any timeouts in the okhttp client for the new Azure SDK. CMIIAW, I don’t think we have it set in the old one. Any suggestion for the new impl?

Also, I checked the old SDK is still supported by the end of this month from: https://azure.github.io/azure-sdk/releases/latest/all/java.html @. Best regards, Wen From: Xiaofei Cao @.> Date: Thursday, March 2, 2023 at 16:32 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.***> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Also, I saw from the log that there's a blockingGet in AzureResilienceInterceptorImpl:

com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)

Usually it should be fine to do blocking calls in Okhttp interceptors. However, if you could share what you do with the blockingGet, it would help us better understand the situation. You could do that in my personal repo: https://github.com/XiaofeiCao/ioexception_reprohttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FXiaofeiCao%2Fioexception_repro&data=05%7C01%7Cwwen%40vmware.com%7C7d319b31b39942447a2408db1af8b766%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638133427778270229%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FTVlTi70qob%2FsbPvU9trce3TAhLBI3M9qEQTDz0MFgs%3D&reserved=0, or email me if that's possible.

Further question, does this line always appear in the exception's stacktrace? Or does the exception happen somewhere else too? If so, could you share the stacktrace?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1451485612&data=05%7C01%7Cwwen%40vmware.com%7C7d319b31b39942447a2408db1af8b766%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638133427778270229%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FejXAQUU3GXZ4Hcg8HUTvG2p%2FhZCCM22FiOniGXKG7c%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XP5AQU4E25ZPXQLDETW2BLLLANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C7d319b31b39942447a2408db1af8b766%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638133427778426437%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Fsh37cmYZokj75iB%2BcxWn9%2BvEDbROT7U3f%2Fw4b7bvL0%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

@wangwenbj Could you show me how you set up your OkHttpClient please? Or do you leave it as default?

wangwenbj commented 1 year ago

Hi, Xiaofei,

private HttpClient buildHttpClient(CredentialAzure credentialAzure) { OkHttpClient.Builder okHttpClientBuilder = new OkHttpClient.Builder(); Set interceptors = azureRestClientConfig.getInterceptors(); if (Objects.nonNull(interceptors) && !interceptors.isEmpty()) { for (Interceptor interceptor: interceptors) { okHttpClientBuilder.addInterceptor(interceptor); } } okHttpClientBuilder.addInterceptor(new HttpLoggingInterceptor().setLevel(HttpLoggingInterceptor.Level.BASIC)) .addInterceptor(new DynamicThrottleInterceptor(credentialAzure.getClientId())) .addInterceptor(azureResilienceInterceptor); OkHttpClient okHttpClient = okHttpClientBuilder.build(); OkHttpAsyncHttpClientBuilder builder = new OkHttpAsyncHttpClientBuilder(okHttpClient) .readTimeout(Duration.of(azureRestClientConfig.getReadTimeoutSecond(), ChronoUnit.SECONDS)) .connectionTimeout(Duration.of(azureRestClientConfig.getConnectTimeoutSecond(), ChronoUnit.SECONDS)); return builder.build(); }

Best regards, Wen

From: Xiaofei Cao @.> Date: Wednesday, March 8, 2023 at 15:56 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

@wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7Cc9018d6ede5f4d01ba8a08db1faaaa20%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638138590087040034%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=MaucyouTHdYHz1lAOKgsy017VEZnckEKDKO9DSg1Wj0%3D&reserved=0 Could you show me how you set up your OkHttpClient please? Or do you leave it as default?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1459684730&data=05%7C01%7Cwwen%40vmware.com%7Cc9018d6ede5f4d01ba8a08db1faaaa20%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638138590087040034%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7ThshM3ALJqZ1wMY5iQxn7%2FEoTZER8WinB9IXJUpdP8%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XLOKVEAJQSLSSBVZX3W3A3T3ANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7Cc9018d6ede5f4d01ba8a08db1faaaa20%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638138590087040034%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=c3E1R5ssYVUUU9OOson1g0ja0aXK5u4g2i2R0Zh%2BQR0%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Thanks @wangwenbj for your code snippet!

I was able to reproduce your situation in my demo repo test.

Exception in thread "Thread-11" reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException
    at reactor.core.Exceptions.propagate(Exceptions.java:396)
    at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:91)
    at reactor.core.publisher.Mono.block(Mono.java:1742)
    at com.azure.resourcemanager.resources.implementation.DeploymentsClientImpl.checkExistence(DeploymentsClientImpl.java:7569)
    at com.azure.resourcemanager.resources.implementation.DeploymentsImpl.checkExistence(DeploymentsImpl.java:102)
    at com.azure.resourcemanager.repro.ioexception.test.undeliverable.CallTimeoutMockTests$1.run(CallTimeoutMockTests.java:129)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.InterruptedException
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1048)
    at java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
    at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)
    ... 5 more
java.io.IOException: Canceled
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at com.azure.resourcemanager.repro.ioexception.test.undeliverable.CallTimeoutMockTests.lambda$buildHttpClient$1(CallTimeoutMockTests.java:177)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

lt's very similar to this issue, in which the calling thread got interrupted.

The IOException: Canceled is logged in OkHttpClient interceptor, caused by the thread interruption. Now it's all about finding where did this interruption occur.

Does this error always occur on this line?

com.azure.resourcemanager.resources.implementation.DeploymentsClientImpl.checkExistence(DeploymentsClientImpl.java:7569)

wangwenbj commented 1 year ago

Hi Xiaofei,

It’s very similar. What I experience are at these calls:

  1. Create resource group
  2. Get NSG
  3. Create VM

We used the async method and used as a blocking outside the chain. e.g. azure.networkSecurityGroups().getByResourceGroupAsync(resourceGroupName, securityGroupName)

Do we have a work around / fix for these kind of issue?

Best regards, Wen

From: Xiaofei Cao @.> Date: Thursday, March 9, 2023 at 17:21 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Thanks @wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7C37059c5f81a94177309f08db207fab33%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638139504950018860%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8XSZrZYX%2FAXCaPBljTlA8mJqJmmPTqdEB13TTVzjQS8%3D&reserved=0 for your code snippet!

I was able to reproduce your situation in my demo repo testhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FXiaofeiCao%2Fioexception_repro%2Fblob%2F7a3c53c2990b1246a83096c11c43c01a7e39a7c4%2Fsrc%2Ftest%2Fjava%2Fcom%2Fazure%2Fresourcemanager%2Frepro%2Fioexception%2Ftest%2Fundeliverable%2FCallTimeoutMockTests.java%23L110&data=05%7C01%7Cwwen%40vmware.com%7C37059c5f81a94177309f08db207fab33%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638139504950018860%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lkeFtYNNd2CFzg%2Bj9YRpkiqVM7MhE1ibdn293g%2B8rgc%3D&reserved=0.

Exception in thread "Thread-11" reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException

at reactor.core.Exceptions.propagate(Exceptions.java:396)

at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:91)

at reactor.core.publisher.Mono.block(Mono.java:1742)

at com.azure.resourcemanager.resources.implementation.DeploymentsClientImpl.checkExistence(DeploymentsClientImpl.java:7569)

at com.azure.resourcemanager.resources.implementation.DeploymentsImpl.checkExistence(DeploymentsImpl.java:102)

at com.azure.resourcemanager.repro.ioexception.test.undeliverable.CallTimeoutMockTests$1.run(CallTimeoutMockTests.java:129)

at java.base/java.lang.Thread.run(Thread.java:833)

Caused by: java.lang.InterruptedException

at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1048)

at java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)

at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)

... 5 more

java.io.IOException: Canceled

at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)

at com.azure.resourcemanager.repro.ioexception.test.undeliverable.CallTimeoutMockTests.lambda$buildHttpClient$1(CallTimeoutMockTests.java:177)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)

at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)

at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)

at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)

at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)

at java.base/java.lang.Thread.run(Thread.java:833)

lt's very similar to this issuehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33829&data=05%7C01%7Cwwen%40vmware.com%7C37059c5f81a94177309f08db207fab33%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638139504950018860%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hfjsQgr21DwO8wZfj00%2BkpRuz6JgVts6pKhvpxtUp2U%3D&reserved=0, in which the calling thread got interrupted.

The IOException: Canceled is logged in OkHttpClient interceptor, caused by the thread interruption.

Does this error always occur on this line?

com.azure.resourcemanager.resources.implementation.DeploymentsClientImpl.checkExistence(DeploymentsClientImpl.java:7569)

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1461637421&data=05%7C01%7Cwwen%40vmware.com%7C37059c5f81a94177309f08db207fab33%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638139504950018860%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=g90NLIEeIFTz1qdP0DrJB6Kwef69bmid%2BOpZG4Gg3QE%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XKEQQQQ4OBKS57KA4DW3GOJVANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7C37059c5f81a94177309f08db207fab33%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638139504950018860%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2sxKww8Zr8dGkOP2krzyvn0c8XA38RqS4ZW4Lpf2cQk%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

Hi @wangwenbj ,

Thanks @wangwenbj for the information.

Sorry for not making my point clear. The above demo only simulated the error log. It's a guess of what actually happened.

I'm still trying to reproduce in normal situations.

XiaofeiCao commented 1 year ago

I've updated my real-time test with 100 concurrent resource group creation and deletion.

I'll leave it running till the bug is reproduced.

Meanwhile, may I know what you did after you throw an Exception in DynamicThrottleInterceptor when the quota delay calculated is positive?

long delay = getQuotaDelay(requestMethod, requestUrl, clientId);

    if (delay > 0) {

        throw new Exception();

    }
weidongxu-microsoft commented 1 year ago

Let's make it simpler.

@XiaofeiCao , you already have the test running. Configure it as best as author's (same OkHttpClient config, same Interceptor configure, same scale, same AKS instance configure if need to be), run it till we see the same problem.

If we reproduce it, diagnose and fix it. If we don't see it, while it does not prove there is no bug in SDK, at least it means the bug is unlikely.

The reason is that apparently we cannot have code from author's stress test, and even if we had it, it may contain too many code that not belong to SDK and could be a cause in itself. We'd like to limit Xiaofei's reproduction on a relatively simple scenario that having minimal non-SDK code, so that it focus on reproducing SDK bug.

@wangwenbj , if you think Xiaofei's test fail to reproduce the problem, please let him know what you'd like him to change. Both Xiaofei and me has email in profile, and you can email us for anything you think might help to diagnose the problem.

wangwenbj commented 1 year ago

Thanks, Weidong, Xiaofei,

I am trying to reproduce this issue from my side as well. I will keep you updated once I can have this issue reproduced as well.

Best regards, Wen

From: Weidong Xu @.> Date: Friday, March 10, 2023 at 20:37 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Let's make it simpler.

@XiaofeiCaohttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FXiaofeiCao&data=05%7C01%7Cwwen%40vmware.com%7Cdc588ea4a1864c4b271f08db21643ea8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638140486656431409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7IVZw2eBsXlKeyrJBNhYUQB9a8J%2FxfI7fsvc%2Fm9r1wQ%3D&reserved=0 , you already have the test running. Configure it as best as author's (same OkHttpClient config, same Interceptor configure, same scale, same AKS instance configure if need to be), run it till we see the same problem.

If we reproduce it, diagnose and fix it. If we don't see it, while it does not prove there is no bug in SDK, at least it means the bug is unlikely.

The reason is that apparently we cannot have code from author's stress test, and even if we had it, it may contain too many code that not belong to SDK and could be a cause in itself. We'd like to limit Xiaofei's test on a relatively simple scenario that having minimal non-SDK code, so that it focus on reproducing SDK bug.

@wangwenbjhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwangwenbj&data=05%7C01%7Cwwen%40vmware.com%7Cdc588ea4a1864c4b271f08db21643ea8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638140486656431409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fH9m%2FeUr1ap50q3Y%2B3F3fgLfD083rMOa1Nj%2Bp%2Fm4h%2FU%3D&reserved=0 , if you think Xiaofei's test fail to reproduce the problem, please let him know what you'd like him to change. Both Xiaofei and me has email in profile, and you can email us for anything you think might help to diagnose the problem.

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-java%2Fissues%2F33056%23issuecomment-1463745809&data=05%7C01%7Cwwen%40vmware.com%7Cdc588ea4a1864c4b271f08db21643ea8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638140486656431409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=x%2BbyoXR1byo%2BPPLN1L%2B7uHJkCgiMFMnZ4q6YAhFIg%2Bc%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESK3XPANKFNUA4W4WYMIH3W3MOBPANCNFSM6AAAAAAT6WTJ2A&data=05%7C01%7Cwwen%40vmware.com%7Cdc588ea4a1864c4b271f08db21643ea8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638140486656431409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3YwDqfk26G6YY%2BGpUMZ311L1tile%2BwT10ACP%2FHmpxA0%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

ghost commented 1 year ago

Thank you for your feedback. This has been routed to the support team for assistance.

wangwenbj commented 1 year ago

@XiaofeiCao According to Azure network support team, this issue seems to happen in the following sequence:

  1. Submit a request. e.g. create a resource group
  2. This request succeeded in secondes
  3. Using the new Azure SDK, we did not see any response in 20 minutes and finally timeout from client side
  4. Azure service got a client failure after 20 minutes and then refused this request.
Screenshot 2023-03-15 at 13 14 58 Screenshot 2023-03-15 at 13 15 04
ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @armleads-azure.

Issue Details
**Describe the bug** We encountered the following errors on heavy load when creating resource group and network security group using Azure Java SDK new version, The Webclient is OkHttpClient. This issue is not happending in the old rxjava version though ***Exception or Stack Trace*** Exception in thread "RxCachedThreadScheduler-141" io.reactivex.rxjava3.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException at io.reactivex.rxjava3.plugins.RxJavaPlugins.onError(RxJavaPlugins.java:372) at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:49) at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855) at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext.subscribeActual(SingleResumeNext.java:39) at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855) at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89) at io.reactivex.rxjava3.core.Scheduler$DisposeTask.run(Scheduler.java:644) at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:65) at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:56) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) **To Reproduce** This issue cannot be reproduced eaisly. It happens every now and then in our production env and we have nowhere to catch and handle this issue. In large scale of resoruce group creation we encounter this issue some times. I have reproduce this only once locally using 100 resource groups provision in parallel. ***Code Snippet*** ResourceGroup.DefinitionStages.WithCreate creator = this.azureResoureManager.resourceGroups().define(resourceGroupName) .withRegion(region); return ReactorToRxV3Interop.monoToSingle(creator.createAsync()); **Expected behavior** No exception happend or if exception happened we could have a way to catch it inside the reactor chain. **Screenshots** API error. No screen shots **Additional context** This part of log is what we catch in our customized okhttp interceptor, however, after the exception is thrown, the upper chain lost track of this exception. Which caused the chain to never stop. 2023-01-11T17:05:44.011Z [trace_id=9492315ecd8cdf9e9db291d40c42e57b] [transaction_id=1e99ae844e81ce79] ERROR [gement.azure.com/...] .i.i.AzureResilienceInterceptorImpl.logRetryInfoForError:506 - Exception: java.io.IOException: Canceled at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:221) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) --   | stream | stdout **Information Checklist** Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report - [ ] Bug Description Added - [ ] Repro Steps Added - [ ] Setup information Added
Author: wangwenbj
Assignees: XiaofeiCao
Labels: `question`, `ARM`, `Service Attention`, `Mgmt`, `customer-reported`, `pillar-reliability`, `needs-team-attention`
Milestone: -
navba-MSFT commented 1 year ago

@armleads-azure Could you please look into this ? Thanks in advance.

CC @jennyhunter-msft @josephkwchan

XiaofeiCao commented 1 year ago

Thanks @wangwenbj , I was able to get the request log from your second screenshot. I believe it's a NetworkSecurityGroup query?

Strangely the httpStatusCode is 404, which means the nsg is not deployed(or less likely, the client sends the wrong URL)... image

I don't know where to locate the log from your first screenshot. Are they targeting the same networkSecurityGroup?

XiaofeiCao commented 1 year ago

Hi @wangwenbj , would you try replacing sync call

Single.fromCallable(() -> azureResourceManager.deployments().checkExistence(resourceGroupName, nsgName))

with below async one, and see if the Exception throws again?

azureResourceManager.deployments().manager().serviceClient().getDeployments().checkExistenceAsync(resourceGroupName, nsgName)

And avoid any sync http calls in reactor/rxjava chain, like the first code snippet(checkExistence's implementation is checkExistenceAsync.block()). I tried it in my repo and it got stuck: https://github.com/XiaofeiCao/ioexception_repro/blob/5db3fbcb4c6b03196d0b56f8555c9fa7849210b7/src/test/java/com/azure/resourcemanager/repro/ioexception/test/undeliverable/BatchCreateResourceGroupTests.java#L107

wangwenbj commented 1 year ago

Thanks, Xiaofei,

We had a Rx wrapper that works as throttle control for preventing Azure 429 response with our own retry machanism. That’s why we use Single.fromCallable() and then blocking. Also, this works for all the Async operations.

Though I could try what you suggested in my own env. I wonder if we could handle this with the existing Rx syn -> async flow?

Best regards, Wen

From: Xiaofei Cao @.> Date: Tuesday, March 28, 2023 at 14:42 To: Azure/azure-sdk-for-java @.> Cc: Wen Wang @.>, Mention @.> Subject: Re: [Azure/azure-sdk-for-java] [BUG] UndeliverableException when creating resource group and network security group in heavy load (Issue #33056) !! External Email

Hi @wangwenbjhttps://github.com/wangwenbj , would you try replacing sync call

Single.fromCallable(() -> azureResourceManager.deployments().checkExistence(resourceGroupName, nsgName))

with below async one, and see if the Exception throws again?

azureResourceManager.deployments().manager().serviceClient().getDeployments().checkExistenceAsync(resourceGroupName, nsgName)

And avoid any sync http calls in reactor/rxjava chain, like the first code snippet(checkExistence's implementation is checkExistenceAsync.block()). I tried it in my repo and it got stuck: https://github.com/XiaofeiCao/ioexception_repro/blob/5db3fbcb4c6b03196d0b56f8555c9fa7849210b7/src/test/java/com/azure/resourcemanager/repro/ioexception/test/undeliverable/BatchCreateResourceGroupTests.java#L107

— Reply to this email directly, view it on GitHubhttps://github.com/Azure/azure-sdk-for-java/issues/33056#issuecomment-1486300382, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AESK3XPC7UTKJ4GC7XORY5DW6KB7BANCNFSM6AAAAAAT6WTJ2A. You are receiving this because you were mentioned.Message ID: @.***>

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

XiaofeiCao commented 1 year ago

I see.

sync -> async is tricky in this case. If you are doing simple sync call without IO operations involved, e.g. getting a model's innerModel's properties, you can safely do that.

But if IO operations are involved, I think you should always avoid it. Like in this case, checkDeploymentExists() is achieved by checkDeploymentExistsAsync().block(), which involves http invocation. You should always resort to an async variant if applicable.

Unfortunately in this case, we didn't provide an async variant in convenience layer. Though you could use serviceClient level code instead, which is

azureResourceManager.deployments().manager().serviceClient().getDeployments().checkExistenceAsync(resourceGroupName, nsgName)

Then wrap it using Single.fromPublisher.

XiaofeiCao commented 1 year ago

Hi, does the issue still persists?

You could also try

Single.fromCallable(() ->
        azureResourceManager
                .deployments()
                .checkExistence(resourceGroupName, nsgName))
        .subscribeOn(Schedulers.io())