Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.36k stars 2k forks source link

Error log from compute client sdk #40081

Open mnsc opened 6 months ago

mnsc commented 6 months ago

Query/Question We are seeing error logs in production when using the resourcemanager sdk (compute client sdk) that is hard to trace back to what we are doing. It does not seem to affect our code so the sdk must do some retry or something to get this working? The error seem to indicate that some endpoint(?) returns unexpected html (white lable error?).

The log row in question with labels from grafana (our own labels redacted)

level   
ERROR

logger  
com.azure.core.implementation.MethodHandleReflectiveInvoker

message 
Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number (or 'NaN'/'INF'/'+INF'), Array, Object or token 'null', 'true' or 'false') at [Source: (byte[])"<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head><meta content='text/html; charset=utf-8' http-equiv='content-type'/><style type='text/css'>body {font-family:Arial; margin-left:40px; }img { border:0 none; }#content { margin-left: auto; margin-right: auto }#message h2 { font-size: 20px; font-weight: normal; color: #000000; margin: 34px 0px 0px 0px }#message p { font-size: 13p"[truncated 488 bytes]; line: 1, column: 2]

stacktrace  

thread
reactor-http-epoll-1

So the question is if this is expected and if so are there any documentation on what different log levels mean?

Why is this not a Bug or a feature Request? It's not a bug because it's about log levels and I don't think "too noisy" is a bug.

Setup (please complete the following information if applicable):

+--- com.azure.resourcemanager:azure-resourcemanager-compute:2.38.0
|    +--- com.azure.resourcemanager:azure-resourcemanager-resources:2.38.0
|    |    +--- com.azure:azure-core:1.48.0 -> 1.49.0 (*)
|    |    \--- com.azure:azure-core-management:1.13.0 (*)
|    +--- com.azure.resourcemanager:azure-resourcemanager-storage:2.38.0
|    |    +--- com.azure.resourcemanager:azure-resourcemanager-resources:2.38.0 (*)
|    |    +--- com.azure.resourcemanager:azure-resourcemanager-msi:2.38.0
|    |    |    +--- com.azure.resourcemanager:azure-resourcemanager-resources:2.38.0 (*)
|    |    |    \--- com.azure.resourcemanager:azure-resourcemanager-authorization:2.38.0
|    |    |         \--- com.azure.resourcemanager:azure-resourcemanager-resources:2.38.0 (*)
|    |    \--- com.azure.resourcemanager:azure-resourcemanager-authorization:2.38.0 (*)
|    +--- com.azure.resourcemanager:azure-resourcemanager-network:2.38.0
|    |    \--- com.azure.resourcemanager:azure-resourcemanager-resources:2.38.0 (*)
|    +--- com.azure.resourcemanager:azure-resourcemanager-msi:2.38.0 (*)
|    \--- com.azure.resourcemanager:azure-resourcemanager-authorization:2.38.0 (*)

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

github-actions[bot] commented 6 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

weidongxu-microsoft commented 6 months ago

You may enable the HTTP logging https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/resourcemanager#enabling-logging

It may help find which API having this probelm.

The error looks like some response is expected to be application/json, but actually is HTML. I didn't remember API in compute would behave like this. However, if auth fails (REST API called with invalid AUTH token etc.), management.azure.com may give HTML response.

github-actions[bot] commented 6 months ago

Hi @mnsc. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

mnsc commented 6 months ago

You may enable the HTTP logging

Looking at your link, do you mean the part about adding this to the resource manager?

.withLogLevel(HttpLogDetailLevel.BODY_AND_HEADERS)

If so, do you think this spring application config would achieve the same result?

spring:
  cloud:
    azure:
      client:
        http:
          logging:
            level: body_and_headers
weidongxu-microsoft commented 6 months ago

You may enable the HTTP logging

Looking at your link, do you mean the part about adding this to the resource manager?

.withLogLevel(HttpLogDetailLevel.BODY_AND_HEADERS)

If so, do you think this spring application config would achieve the same result?

spring:
  cloud:
    azure:
      client:
        http:
          logging:
            level: body_and_headers

My guess is this won't work.

On this configure, I assume spring would use reflection to set the values. I don't think the AzureResourceManager class would take that.

mnsc commented 6 months ago

So we started logging body+headers in our dev environment and after only a couple of days we caught a live one. These are the three relevant log rows with "level logger thread \n message".

INFO    com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup    scheduling-1    
{"az.sdk.message":"HTTP request","method":"GET","url":"https://management.azure.com/subscriptions/00000000-SUBS-CRIP-TION-ID0000000000/resourceGroups/resourcegroup-redacted-dev/providers/Microsoft.Compute/virtualMachineScaleSets/dev-vmss-name?api-version=2024-03-01","tryCount":1,"Date":"Tue, 04 Jun 2024 20:35:59 GMT","x-ms-return-client-request-id":"true","Content-Type":"application/json","x-ms-client-request-id":"8967f0c0-4559-47a6-80d1-5c6d02c91a74","Accept":"application/json","User-Agent":"azsdk-java-com.azure.resourcemanager.compute/2.39.0 (21.0.2; Linux; 6.5.0-1021-azure)","redactedHeaders":"Authorization","content-length":0}

INFO    com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup    reactor-http-epoll-4    
{"az.sdk.message":"HTTP response","statusCode":503,"url":"https://management.azure.com/subscriptions/00000000-SUBS-CRIP-TION-ID0000000000/resourceGroups/resourcegroup-redacted-dev/providers/Microsoft.Compute/virtualMachineScaleSets/dev-vmss-name?api-version=2024-03-01","durationMs":2507,"content-length":988,"Date":"Tue, 04 Jun 2024 20:36:01 GMT","Content-Type":"text/html","Cache-Control":"no-store","redactedHeaders":"X-Azure-ExternalError,X-MSEdge-Ref","Date":"Tue, 04 Jun 2024 20:36:01 GMT","Content-Type":"text/html","Cache-Control":"no-store","redactedHeaders":"X-Azure-ExternalError,X-MSEdge-Ref","content-length":988,"body":"<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head><meta content='text/html; charset=utf-8' http-equiv='content-type'/><style type='text/css'>body {font-family:Arial; margin-left:40px; }img  { border:0 none; }#content { margin-left: auto; margin-right: auto }#message h2 { font-size: 20px; font-weight: normal; color: #000000; margin: 34px 0px 0px 0px }#message p  { font-size: 13px; color: #000000; margin: 7px 0px 0px0px}#errorref { font-size: 11px; color: #737373; margin-top: 41px }</style><title>AzureResourceManager</title></head><body><div id='content'><div id='message'><h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p></div><div id='errorref'><span>Ref A: CC1883B90DF94E9992AA02E71296656D Ref B: AMS231020512039 Ref C: 2024-06-04T20:35:59Z</span></div></div></body></html>"}

ERROR   com.azure.core.implementation.MethodHandleReflectiveInvoker reactor-http-epoll-4    
Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number (or 'NaN'/'INF'/'+INF'), Array, Object or token 'null', 'true' or 'false') at [Source: (byte[])"<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head><meta content='text/html; charset=utf-8' http-equiv='content-type'/><style type='text/css'>body {font-family:Arial; margin-left:40px; }img  { border:0 none; }#content { margin-left: auto; margin-right: auto }#message h2 { font-size: 20px; font-weight: normal; color: #000000; margin: 34px 0px 0px 0px }#message p  { font-size: 13p"[truncated 488 bytes]; line: 1, column: 2]

Looks like the scaleset client in the compute lib is not handling a statusCode 503 properly and tries to parse the body as json. But it could also be the api that should respond with a json body on 503.

weidongxu-microsoft commented 6 months ago

Thanks for the log, @mnsc

I think 503 (most of the 5xx) would be retried by the default RetryPolicy of SDK. https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/core/azure-core/src/main/java/com/azure/core/http/policy/RetryStrategy.java#L50-L52

I will double check whether the HTML (that result in JSON parse error) would affect that retry logic.

weidongxu-microsoft commented 5 months ago

I've made local test with mock 503.

I can see the RetryPolicy works as expected. It would retry for 3 times.

mnsc commented 5 months ago

Does the mock 503 have the same html-body?

weidongxu-microsoft commented 5 months ago

I tested with a short html <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'> (content-type be text/html, same as yours) -- anyway that's not a valid JSON and de-serialization will throw exception

I can see similar logs as yours {"az.sdk.message":"HTTP response","statusCode":503, following with error. Client would continue retry for 3 more times on this API call (as the default RetryPolicy behavior).


Other than the error logs, does it cause other problem in your application?

As a SDK/client, it is hard to exclude the possibility of 503 from backend. I think as long as client does do the retry, it should be correct on SDK side. Though it may still be necessary for you to code the protection against all retry fails (the default retry delay would be 1s + 2s + 4s).

If server 503 causes problem to your application, you may raise a support ticket to have the issue report to Azure service directly.

mnsc commented 5 months ago

It doesn't cause a direct problem in our application. But it introduces noise in our logging. We would like that ERROR logs indicates some issue that suggests we maybe would want to change something, add an exception handler or add/fix som config.

In this case I think it is a bug in the sdk. I expect the sdk to handle backend unavailable 503 with retries and be able to silently handle unexpected(?) html in the body. If the client need to read som json in the body of the 503 in order to do this, and the backend doesn't send expected body, then maybe it would be considered a bug in the backend?

weidongxu-microsoft commented 5 months ago

Yes, I think it makes sense. If SDK is still retrying, it should not log an ERROR.

Do you have a call stack in log?

I am getting problem when trying to pinpoint which part of the azure-core code log this ERROR on com.azure.core.implementation.MethodHandleReflectiveInvoker

I assume this com.azure.core.implementation.MethodHandleReflectiveInvoker is from %c of log4j2 https://logging.apache.org/log4j/1.x/apidocs/org/apache/log4j/PatternLayout.html

In my testing (with mock endpoint), I actually cannot find this particular ERROR log. All I see is this:

12:05:54.507 INFO  [main] com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup - {"az.sdk.message":"HTTP request","method":"GET","url":"http://localhost:3000/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg/providers/Microsoft.Compute/virtualMachineScaleSets/vm?api-version=2024-03-01","tryCount":1,"Date":"Wed, 19 Jun 2024 04:05:54 GMT","Content-Type":"application/json","x-ms-client-request-id":"bbae5430-8039-4512-952b-bcac438579f0","Accept":"application/json","User-Agent":"azsdk-java","content-length":0}
12:05:54.825 INFO  [reactor-http-nio-1] com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup - {"az.sdk.message":"HTTP response","statusCode":503,"url":"http://localhost:3000/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg/providers/Microsoft.Compute/virtualMachineScaleSets/vm?api-version=2024-03-01","durationMs":317,"content-length":988,"Date":"Wed, 19 Jun 2024 04:05:54 GMT","Connection":"keep-alive","ETag":"W/\"3dc-NT0UQORPQdt00AAWSDDAFLRG+I0\"","Content-Type":"text/html","redactedHeaders":"Keep-Alive,X-Powered-By","content-length":988,"body":"<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head><meta content='text/html; charset=utf-8' http-equiv='content-type'/><style type='text/css'>body {font-family:Arial; margin-left:40px; }img  { border:0 none; }#content { margin-left: auto; margin-right: auto }#message h2 { font-size: 20px; font-weight: normal; color: #000000; margin: 34px 0px 0px 0px }#message p  { font-size: 13px; color: #000000; margin: 7px 0px 0px0px}#errorref { font-size: 11px; color: #737373; margin-top: 41px }</style><title>AzureResourceManager</title></head><body><div id='content'><div id='message'><h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p></div><div id='errorref'><span>Ref A: CC1883B90DF94E9992AA02E71296656D Ref B: AMS231020512039 Ref C: 2024-06-04T20:35:59Z</span></div></div></body></html>"}
12:05:54.826 DEBUG [reactor-http-nio-1] com.azure.core.http.policy.RetryPolicy - {"az.sdk.message":"Retrying.","tryCount":0,"durationMs":823}
12:05:54.826 DEBUG [reactor-http-nio-1] com.azure.core.http.policy.RetryPolicy - {"az.sdk.message":"Retrying.","tryCount":0,"durationMs":823}
12:05:55.671 INFO  [parallel-1] com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup - {"az.sdk.message":"HTTP request","method":"GET","url":"http://localhost:3000/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg/providers/Microsoft.Compute/virtualMachineScaleSets/vm?api-version=2024-03-01","tryCount":2,"Date":"Wed, 19 Jun 2024 04:05:55 GMT","Content-Type":"application/json","x-ms-client-request-id":"bbae5430-8039-4512-952b-bcac438579f0","Accept":"application/json","User-Agent":"azsdk-java","content-length":0}
12:05:55.680 INFO  [reactor-http-nio-1] com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup - {"az.sdk.message":"HTTP response","statusCode":503,"url":"http://localhost:3000/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg/providers/Microsoft.Compute/virtualMachineScaleSets/vm?api-version=2024-03-01","durationMs":9,"content-length":988,"Date":"Wed, 19 Jun 2024 04:05:55 GMT","Connection":"keep-alive","ETag":"W/\"3dc-NT0UQORPQdt00AAWSDDAFLRG+I0\"","Content-Type":"text/html","redactedHeaders":"Keep-Alive,X-Powered-By","content-length":988,"body":"<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head><meta content='text/html; charset=utf-8' http-equiv='content-type'/><style type='text/css'>body {font-family:Arial; margin-left:40px; }img  { border:0 none; }#content { margin-left: auto; margin-right: auto }#message h2 { font-size: 20px; font-weight: normal; color: #000000; margin: 34px 0px 0px 0px }#message p  { font-size: 13px; color: #000000; margin: 7px 0px 0px0px}#errorref { font-size: 11px; color: #737373; margin-top: 41px }</style><title>AzureResourceManager</title></head><body><div id='content'><div id='message'><h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p></div><div id='errorref'><span>Ref A: CC1883B90DF94E9992AA02E71296656D Ref B: AMS231020512039 Ref C: 2024-06-04T20:35:59Z</span></div></div></body></html>"}
12:05:55.681 DEBUG [reactor-http-nio-1] com.azure.core.http.policy.RetryPolicy - {"az.sdk.message":"Retrying.","tryCount":1,"durationMs":1647}
12:05:55.681 DEBUG [reactor-http-nio-1] com.azure.core.http.policy.RetryPolicy - {"az.sdk.message":"Retrying.","tryCount":1,"durationMs":1647}
12:05:57.336 INFO  [parallel-2] com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup - {"az.sdk.message":"HTTP request","method":"GET","url":"http://localhost:3000/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg/providers/Microsoft.Compute/virtualMachineScaleSets/vm?api-version=2024-03-01","tryCount":3,"Date":"Wed, 19 Jun 2024 04:05:57 GMT","Content-Type":"application/json","x-ms-client-request-id":"bbae5430-8039-4512-952b-bcac438579f0","Accept":"application/json","User-Agent":"azsdk-java","content-length":0}
12:05:57.341 INFO  [reactor-http-nio-1] com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup - {"az.sdk.message":"HTTP response","statusCode":503,"url":"http://localhost:3000/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg/providers/Microsoft.Compute/virtualMachineScaleSets/vm?api-version=2024-03-01","durationMs":4,"content-length":988,"Date":"Wed, 19 Jun 2024 04:05:57 GMT","Connection":"keep-alive","ETag":"W/\"3dc-NT0UQORPQdt00AAWSDDAFLRG+I0\"","Content-Type":"text/html","redactedHeaders":"Keep-Alive,X-Powered-By","content-length":988,"body":"<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head><meta content='text/html; charset=utf-8' http-equiv='content-type'/><style type='text/css'>body {font-family:Arial; margin-left:40px; }img  { border:0 none; }#content { margin-left: auto; margin-right: auto }#message h2 { font-size: 20px; font-weight: normal; color: #000000; margin: 34px 0px 0px 0px }#message p  { font-size: 13px; color: #000000; margin: 7px 0px 0px0px}#errorref { font-size: 11px; color: #737373; margin-top: 41px }</style><title>AzureResourceManager</title></head><body><div id='content'><div id='message'><h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p></div><div id='errorref'><span>Ref A: CC1883B90DF94E9992AA02E71296656D Ref B: AMS231020512039 Ref C: 2024-06-04T20:35:59Z</span></div></div></body></html>"}
12:05:57.341 DEBUG [reactor-http-nio-1] com.azure.core.http.policy.RetryPolicy - {"az.sdk.message":"Retrying.","tryCount":2,"durationMs":3053}
12:05:57.341 DEBUG [reactor-http-nio-1] com.azure.core.http.policy.RetryPolicy - {"az.sdk.message":"Retrying.","tryCount":2,"durationMs":3053}
12:06:00.402 INFO  [parallel-3] com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup - {"az.sdk.message":"HTTP request","method":"GET","url":"http://localhost:3000/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg/providers/Microsoft.Compute/virtualMachineScaleSets/vm?api-version=2024-03-01","tryCount":4,"Date":"Wed, 19 Jun 2024 04:06:00 GMT","Content-Type":"application/json","x-ms-client-request-id":"bbae5430-8039-4512-952b-bcac438579f0","Accept":"application/json","User-Agent":"azsdk-java","content-length":0}
12:06:00.409 INFO  [reactor-http-nio-1] com.azure.resourcemanager.compute.implementation.VirtualMachineScaleSetsClientImpl$VirtualMachineScaleSetsService.getByResourceGroup - {"az.sdk.message":"HTTP response","statusCode":503,"url":"http://localhost:3000/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg/providers/Microsoft.Compute/virtualMachineScaleSets/vm?api-version=2024-03-01","durationMs":6,"content-length":988,"Date":"Wed, 19 Jun 2024 04:06:00 GMT","Connection":"keep-alive","ETag":"W/\"3dc-NT0UQORPQdt00AAWSDDAFLRG+I0\"","Content-Type":"text/html","redactedHeaders":"Keep-Alive,X-Powered-By","content-length":988,"body":"<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head><meta content='text/html; charset=utf-8' http-equiv='content-type'/><style type='text/css'>body {font-family:Arial; margin-left:40px; }img  { border:0 none; }#content { margin-left: auto; margin-right: auto }#message h2 { font-size: 20px; font-weight: normal; color: #000000; margin: 34px 0px 0px 0px }#message p  { font-size: 13px; color: #000000; margin: 7px 0px 0px0px}#errorref { font-size: 11px; color: #737373; margin-top: 41px }</style><title>AzureResourceManager</title></head><body><div id='content'><div id='message'><h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p></div><div id='errorref'><span>Ref A: CC1883B90DF94E9992AA02E71296656D Ref B: AMS231020512039 Ref C: 2024-06-04T20:35:59Z</span></div></div></body></html>"}
12:06:00.410 INFO  [reactor-http-nio-1] com.azure.core.http.policy.RetryPolicy - {"az.sdk.message":"Retry attempts have been exhausted.","tryCount":3}
12:06:00.410 INFO  [reactor-http-nio-1] com.azure.core.http.policy.RetryPolicy - {"az.sdk.message":"Retry attempts have been exhausted.","tryCount":3}
12:06:00.426 WARN  [reactor-http-nio-1] com.azure.core.implementation.serializer.HttpResponseBodyDecoder - Failed to deserialize the error entity.
Unsupported text Content-Type Type: class com.azure.resourcemanager.compute.models.ApiError

There is only 1 WARN, and it only happen after all retry count exausted.