Open barryku opened 5 years ago
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @rthorn17
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @rthorn17
Hi again @barryku similar to the report in the other repo, I'm labeling this to help the team find it.
It's getting worse, and would really appreciate some help. We noticed big jump in mid-July, and have seen tons of errors each week. We are calling these APIs hundreds of times each hour, and here are numbers of errors we encountered each week since April,
2019-04-01 | 2 2019-04-08 | 4 2019-04-15 | 3 2019-04-22 | 32 2019-04-29 | 15 2019-05-06 | 2 2019-05-13 | 11 2019-05-20 | 19 2019-05-27 | 34 2019-06-03 | 98 2019-06-10 | 35 2019-06-17 | 7 2019-06-24 | 88 2019-07-01 | 107 2019-07-08 | 272 2019-07-15 | 478 2019-07-22 | 502 2019-07-29 | 399 2019-08-05 | 892
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @mjconnection
Hi @barryku a few questions as we try and figure this out.
Hello @barryku
If possible could you add some log output related to the timeout? Sample code:
Azure azure = Azure.configure()
.withLogLevel(LogLevel.BODY_AND_HEADERS)
...
Also if possible could you describe concurrency/theading model in the project?
I have checked 1.12.0 code. I think the default timeout is 60 seconds (configure on okhttp3, while it might have a connection pool to reuse socket). However it appears that it is hard to configure it from client.
Meanwhile I will be querying Azure.virtualMachines().getByResourceGroup() on my azure resource (about 1 query per second). So far I haven't seen SocketTimeoutException.
To configure readtimeout in azure client use withReadTimeout
setter:
Azure.configure()
.withReadTimeout(120, TimeUnit.SECONDS)
@loarabia no, we have not had significant changes during the time. Neither did changes to the libraries we use. Not sure what kind of statistics you are looking for. It's our backup solution, so it goes through customer instances and do snapshot. The same is done hundreds of times each hour, and it works fine the majority of times. However, the number of exceptions have caused lots of pain to the customers since they will have to manually do the backup for the failed ones.
We send logs to Splunk, and these error counts are from Splunk query. Here's the stats of last 3 days
2019-08-11 | 96 2019-08-12 | 136 2019-08-13 | 122
Hourly stats from yesterday
2019-08-13 00:00 | 7 2019-08-13 01:00 | 34 2019-08-13 02:00 | 0 2019-08-13 03:00 | 8 2019-08-13 04:00 | 2 2019-08-13 05:00 | 11 2019-08-13 06:00 | 2 2019-08-13 07:00 | 6 2019-08-13 08:00 | 6 2019-08-13 09:00 | 0 2019-08-13 10:00 | 6 2019-08-13 11:00 | 0 2019-08-13 12:00 | 0 2019-08-13 13:00 | 1 2019-08-13 14:00 | 0 2019-08-13 15:00 | 6 2019-08-13 16:00 | 1 2019-08-13 17:00 | 12 2019-08-13 18:00 | 4 2019-08-13 19:00 | 0 2019-08-13 20:00 | 3 2019-08-13 21:00 | 6 2019-08-13 22:00 | 3 2019-08-13 23:00 | 4
Hi @barryku a few questions as we try and figure this out.
- There appears to be a jump around the beginning of June and then another around the beginning of August. Did your team happen to change any dependencies or do any larger code changes around this time period?
- Along the same line as the dependencies did any of the Azure SDKs in these projects get revved?
- You've offered a few examples of API calls that you've seen this on do you have more verbose statistics for the calls?
To configure readtimeout in azure client use
withReadTimeout
setter:Azure.configure() .withReadTimeout(120, TimeUnit.SECONDS)
Thank you. I did find the API doing that, but just not sure if it will help my situation. It also seems the timeout is about 1 min regardless of what I set it to. I disabled wifi to get the timeout, and didn't find it changed much even changing both connection and read timeout.
I am adding retry logic for now, but couldn't figure out a way to induce "socket closed" exception to verify the logic. I am on Mac, wondering if anyone knows how to do it.
Hello @barryku
If possible could you add some log output related to the timeout? Sample code:
Azure azure = Azure.configure() .withLogLevel(LogLevel.BODY_AND_HEADERS) ...
Also if possible could you describe concurrency/theading model in the project?
I have checked 1.12.0 code. I think the default timeout is 60 seconds (configure on okhttp3, while it might have a connection pool to reuse socket). However it appears that it is hard to configure it from client.
Meanwhile I will be querying Azure.virtualMachines().getByResourceGroup() on my azure resource (about 1 query per second). So far I haven't seen SocketTimeoutException.
It's our production server that I can't change these liberally. I also worked with Auzre support, and set up tdpdump on our stage server for their investigation. Unfortunately, stage server is not used much, only couple of errors in a week or two. We did captured one, but that didn't yield anything useful. You brought a good point though. I will suggest my team to externalize the Azure log setting, so it can be adjusted on-demand in the future.
Hi @barryku
I still have no luck to reproduce on my local.
Here are some tips for mock the timeout (say, for test retry logic on dev):
MockWebServer server = new MockWebServer();
server.enqueue(
new MockResponse().setBody("throttled response")
.throttleBody(10, 1, TimeUnit.MINUTES)
);
server.start();
HttpUrl baseUrl = server.url("");
System.out.println("mock server url: " + baseUrl.toString());
ApplicationTokenCredentials credentials = ApplicationTokenCredentials.fromFile(new File(CRED_FILE));
Map<String, String> endpoints = credentials.environment().endpoints();
endpoints.put(AzureEnvironment.Endpoint.RESOURCE_MANAGER.identifier(),
endpoints.get(AzureEnvironment.Endpoint.RESOURCE_MANAGER.identifier()).replace("https://management.azure.com", baseUrl.toString()));
System.out.println("azure endpoints: " + endpoints);
Azure azure = Azure.configure()
.withLogLevel(LogLevel.BODY_AND_HEADERS)
.authenticate(credentials)
.withDefaultSubscription();
VirtualMachine vm1 = azure.virtualMachines().getByResourceGroup(GROUP_NAME, VM_NAME);
MockWebServer is com.squareup.okhttp3:mockwebserver:3.4.2 Basically the code just replace azure resource management base url with the localhost under MockWebServer, which serves a really slow response.
This code only serves for dev testing. And it is not same as timeout in real network.
For retry logic, Fluent does not have in-house support for timeout exception. One possible reason could be that it be subtle to replay PUT/POST/PATCH since server might receive them more than once (GET should be fine).
Client could put in Interceptor to customize timeout retry logic. Here is a very crude sample Interceptor (which retry for GET upto max 3 times):
Azure azure = Azure.configure()
.withLogLevel(LogLevel.BODY_AND_HEADERS)
.withInterceptor(new Interceptor() {
@Override
public Response intercept(Chain chain) throws IOException {
final int MAX_RETRY = 3;
Response response = null;
int retryCount = 0;
while (response == null) {
try {
response = chain.proceed(chain.request());
} catch (SocketTimeoutException ste) {
if (++retryCount <= MAX_RETRY && "GET".equalsIgnoreCase(chain.request().method())) {
// log and retry
System.out.println("retry count " + retryCount + ", on socket timeout: " + ste);
} else {
throw ste;
}
}
}
return response;
}
})
.authenticate(credentials)
.withDefaultSubscription();
Again the code servers as sample. Do not use it directly in production.
Removing "Service Attention" label because this does not appear to be a backend service issue.
@barryku do you still experience this issue?
@yaohaizh , we added retry logic, and it's working fine so far. The exception is probably still happening.
Removing compute because this appears to be independent of compute.
Query/Question We are seeing more timeout exceptions calling various Azure APIs these days. All of them failed with socket closed then timeout exception. What can be the cause of these issues? Will increase read timeout in Azure client help?
The two most common API calls that failed are Azure.virtualMachines().getByResourceGroup() and Azure.managementLocks().deleteByIds(). We are using 1.12.0 SDK. Here is one example stack trace,
Setup (please complete the following information if applicable):
Version of the Library used: 1.12.0
Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report