Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.32k stars 1.97k forks source link

[BUG] Making EventHub/ServiceBus IT more stable #31355

Closed fangjian0423 closed 1 year ago

fangjian0423 commented 1 year ago

Context

Pipeline IT jobs always run failed, refer Jobs in run #20221004.1, Jobs in run #20221007.1 and Jobs in run #20220929.1.

It seems both IT failed because of the Access control is not active.

[com.azure.storage](http://com.azure.storage/).blob.models.BlobStorageException: If you are using a StorageSharedKeyCredential, and the server returned an error message that says 'Signature did not match' ..
reactor.core.Exceptions$ErrorCallbackNotImplemented: [com.azure](http://com.azure/).messaging.servicebus.ServiceBusException: Unauthorized access. 'Send' claim(s) are required to perform this operation. Resource: 'sb://***.***/topic1'. TrackingId:3ff7c86d4b574b55b8a2f5ab7653f5ff_G26, SystemTracker:gateway7, Timestamp:2022-10-04T15:28:18
com.azure.messaging.servicebus.ServiceBusException: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'sb://***.***/topic1/subscriptions/topicsub'. TrackingId:b416725b32a64130b324c61e81f86f4f_G28, SystemTracker:gateway7, Timestamp:2022-09-29T07:24:16, errorContext[NAMESPACE: ***.***. ERROR CONTEXT: N/A, PATH: topic1/subscriptions/topicSub, REFERENCE_ID: topic1/subscriptions/topicSub_df63ff_1664436255091, LINK_CREDIT: 1]

Rerun IT maybe success but I don't think it is the best way. We need investigate why it always run failed.

Goal

Making java-spring-test pipeline jobs more stable.

fangjian0423 commented 1 year ago

IT added produce error scenario before, https://github.com/Azure/azure-sdk-for-java/issues/31109

fangjian0423 commented 1 year ago

Access control scenarios

  1. Event Hubs: Storage ✅, Data ❌

Message produce failed. We can get error message in error MessageChannel and add retry configuration items.

  1. Event Hubs: Storage ❌, Data ✅

Message produce success, consume error, and we can't get error message in error MessageChannel.

Message consume will auto retry by ScheduledExecutorService. Besides, we should modify partition start position(other wise, we can't receive the message produce before).

  1. Event Hubs: Storage ❌, Data ❌

Similar with scenario 1 and 2.

  1. Event Hubs: Storage ✅, Data ✅

Event Hubs cases run successfully.

  1. Service Bus: Data ❌

Similar with scenario 1. We can add retry configuration items.

  1. Service Bus: Data ✅

Service Bus cases run successfully.

Conclusion

  1. Adding retry configuration items in message produce phase.
  2. Modify partition start position to earliest in message consume phase.
  3. CountDownLatch await time of all cases update to 600s.
fangjian0423 commented 1 year ago

Josh confirm that there's no lag for the permission to take effect.

Besides, Bill takes a look at the the pipeline log. He found the token acquisition succeeded, but it's really strange the permissions issue occur.

And he gives some advices: some part of the environment isn’t being cleaned up and thus being configured by some other test resource deployment.

fangjian0423 commented 1 year ago

The assignment role is correct by run log.

Besides, the test resource will be removed at last by run log.

Other useful information: IT runs begin at 15:27:15 PM, and the test resource deploy successfully at 15:24:10 PM.

fangjian0423 commented 1 year ago

Now the permissions error occur occasionally.

After team discussion, we decide to keep watching the pipeline. if the permissions error occur frequently, we will try to fix it.

backwind1233 commented 1 year ago

ServicebusIT runnint test result on 2022-10-21

Try Link status Failed IT
try 1 link passed
try 2 link passed
try 3 link failed ServiceBusMultiBindersIT ServiceBusSingleBinderIT
try 4 link passed
try 5 link failed build from source failed, network issue
try 6 link failed ServiceBusMultiBindersIT ServiceBusSingleBinderIT
try 7 link failed ServiceBusMultiBindersIT
try 8 link passed
try 9 link failed ServiceBusMultiBindersIT ServiceBusSingleBinderIT ServiceBusIT
try 10 link failed ServiceBusMultiBindersIT ServiceBusSingleBinderIT ServiceBusIT
fangjian0423 commented 1 year ago

Some useful informations:

  1. Run IT in local, the error log is same with pipeline log if not do role assignment. com.azure.messaging.servicebus.ServiceBusException: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: ...

  2. pipeline log shows 4 role assignments provisioning status is succeeded(there are 4 Microsoft.Authorization/roleAssignments in test-resources.json). Resource Microsoft.Authorization/roleAssignments ... provisioning status is succeeded

  3. Timeline:

    • Test resource deploy successfully at 03:07:00
    • IT runs begin at 03:07:58
    • Permission error occur at 03:08:05

Suppose there is a lag for permission to take effect.

backwind1233 commented 1 year ago

Suppose there is a lag for permission to take effect.

I think @fangjian0423 should be right, we can reproduce this behavior in locally.

Reproduce steps

  1. Setup Azure Resources for ServiceBusSingleBinderIT.
    1. create a servicebus namespace.
    2. create a queue named queue1
    3. create a topic named topic1
    4. crete a subscription topicSub in topic1
  2. Configure application.yml and application-servicebus-binder-single.yml
  3. Assign role to the service principal and run ServiceBusSingleBinderIT immediately.
  4. Run ServiceBusSingleBinderIT after 7 minutes.

Conclusion

Even though the assignment is completed, it will take 3~7 minutes to take effect.

fangjian0423 commented 1 year ago

After update the sequence of IT resources below:

IT results:

Link Status Failed IT Reason
try 1 link UsGov failed StorageQueueIT, StorageBlobIT, EventHubBinderBatchModeIT Storage Permission error
try 2 link passed / Warning: Public Service Bus Resource Deploy failed
try 3 link All failed EventHubsBinderRecordModeIT, EventHubsBinderBatchModeIT Public -> EventHubsBinderRecordModeIT failed(seems consumer didn't startup, but no error logs), UsGov -> IT passed, remove test resource error, China -> EventHubs Storage Permission error
try 4 link UsGov failed EventHubsBinderBatchModeIT, StorageBlobIT, StorageQueueIT Storage Permission error
try 5 link passed / Warning: Public remove rest resource error
try 6 link China failed StorageQueueIT Storage Permission error
try 7 link passed / /
try 8 link passed / /
try 9 link Public, UsGov failed EventHubsBinderBatchModeIT Public -> App Config Resource Deploy failed, UsGov -> Storage Permission error

The effect is not very good, we need try another way.

fangjian0423 commented 1 year ago

Find something useful in Azure RBAC Troubleshoot:

Symptom - Role assignment changes are not being detected

You recently added or updated a role assignment, but the changes are not being detected. You might see the message Status: 401 (Unauthorized).

Cause 1

Azure Resource Manager sometimes caches configurations and data to improve performance. When you assign roles or remove role assignments, it can take up to 30 minutes for changes to take effect.

Solution 1

If you are using the Azure portal, Azure PowerShell, or Azure CLI, you can force a refresh of your role assignment changes by signing out and signing in. If you are making role assignment changes with REST API calls, you can force a refresh by refreshing your access token.

If you are add or remove a role assignment at management group scope and the role has DataActions, the access on the data plane might not be updated for several hours. This applies only to management group scope and the data plane.

fangjian0423 commented 1 year ago

I think there are 3 ways to resolve the problem(or improve IT success rate):

  1. Adding Microsoft.Resources/deploymentScripts resource

Script content is "start-sleep -Seconds 120", sleep 120s to wait the role assignment effect.

I try 9 rounds after adding Microsoft.Resources/deploymentScripts resource, 4 passed and 5 failed (resource deploy issue). Looks good.

pros: easy and not waste resource cons: Azure China Cloud don't support Microsoft.Resources/deploymentScripts resource, increase IT run time

  1. Adding dummy resource at last.

Similar with way 1. Using dummy resource replace of sleep script.

I try to put ServiceBus and EventHubs to top2 deploy resource, but EventHubs also sometimes occur storage permission error, refer https://github.com/Azure/azure-sdk-for-java/issues/31355#issuecomment-1286488480. So i think we need adding dummy resource.

pros: easy cons: resource waste, increase IT run time

  1. Don't use Service Principal and Role Assignment, use Connection String.

Refer Azure RBAC Troubleshoot -> Symptom - Role assignment changes are not being detected.

pros: will fix it completely. cons: lots of modifications.

fangjian0423 commented 1 year ago

Choose adding dummy resource at last.

IT results:

Link Status Failed IT Reason Dummy Resource Deploy Time Spent(random choose 1 job)
try 1 link Public failed EventHubsBinderRecordModeIT Consumer didn't startup, but no error log(not permission issue) 3m54s
try 2 link China failed EventHubsBinderSyncModeIT Consumer didn't startup, but no error log(not permission issue) 2m19s
try 3 link passed / / 3m35s
try 4 link China failed / deploy issue 2m12s
try 5 link passed / / 4m33s
try 6 link passed / / 3m48s
try 7 link passed / / 3m45s
try 8 link UsGov failed / deploy issue 6m63s
try 9 link UsGov failed / deploy issue 7m59s
try 10 link passed / / 4m36s

There is no permission issue, looks good.

fangjian0423 commented 1 year ago

Closing this issue via adding dummy resource and modify the sequence of deploy resource to improve IT success rate.