Closed fangjian0423 closed 1 year ago
IT added produce error scenario before, https://github.com/Azure/azure-sdk-for-java/issues/31109
Message produce failed. We can get error message in error MessageChannel and add retry configuration items.
Message produce success, consume error, and we can't get error message in error MessageChannel.
Message consume will auto retry by ScheduledExecutorService
. Besides, we should modify partition start position(other wise, we can't receive the message produce before).
Similar with scenario 1 and 2.
Event Hubs cases run successfully.
Similar with scenario 1. We can add retry configuration items.
Service Bus cases run successfully.
Josh confirm that there's no lag for the permission to take effect.
Besides, Bill takes a look at the the pipeline log. He found the token acquisition succeeded, but it's really strange the permissions issue occur.
And he gives some advices: some part of the environment isn’t being cleaned up and thus being configured by some other test resource deployment.
Now the permissions error occur occasionally.
After team discussion, we decide to keep watching the pipeline. if the permissions error occur frequently, we will try to fix it.
ServicebusIT runnint test result on 2022-10-21
4/10 passed
6/10 failed
Try | Link | status | Failed IT |
---|---|---|---|
try 1 | link | passed | |
try 2 | link | passed | |
try 3 | link | failed | ServiceBusMultiBindersIT ServiceBusSingleBinderIT |
try 4 | link | passed | |
try 5 | link | failed | build from source failed, network issue |
try 6 | link | failed | ServiceBusMultiBindersIT ServiceBusSingleBinderIT |
try 7 | link | failed | ServiceBusMultiBindersIT |
try 8 | link | passed | |
try 9 | link | failed | ServiceBusMultiBindersIT ServiceBusSingleBinderIT ServiceBusIT |
try 10 | link | failed | ServiceBusMultiBindersIT ServiceBusSingleBinderIT ServiceBusIT |
Some useful informations:
Run IT in local, the error log is same with pipeline log if not do role assignment.
com.azure.messaging.servicebus.ServiceBusException: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: ...
pipeline log shows 4 role assignments provisioning status is succeeded(there are 4 Microsoft.Authorization/roleAssignments in test-resources.json).
Resource Microsoft.Authorization/roleAssignments ... provisioning status is succeeded
Timeline:
Suppose there is a lag for permission to take effect.
Suppose there is a lag for permission to take effect.
I think @fangjian0423 should be right, we can reproduce this behavior in locally.
topicSub
in topic1
Even though the assignment is completed, it will take 3~7 minutes to take effect.
After update the sequence of IT resources below:
IT results:
Link | Status | Failed IT | Reason |
---|---|---|---|
try 1 link | UsGov failed | StorageQueueIT, StorageBlobIT, EventHubBinderBatchModeIT | Storage Permission error |
try 2 link | passed | / | Warning: Public Service Bus Resource Deploy failed |
try 3 link | All failed | EventHubsBinderRecordModeIT, EventHubsBinderBatchModeIT | Public -> EventHubsBinderRecordModeIT failed(seems consumer didn't startup, but no error logs), UsGov -> IT passed, remove test resource error, China -> EventHubs Storage Permission error |
try 4 link | UsGov failed | EventHubsBinderBatchModeIT, StorageBlobIT, StorageQueueIT | Storage Permission error |
try 5 link | passed | / | Warning: Public remove rest resource error |
try 6 link | China failed | StorageQueueIT | Storage Permission error |
try 7 link | passed | / | / |
try 8 link | passed | / | / |
try 9 link | Public, UsGov failed | EventHubsBinderBatchModeIT | Public -> App Config Resource Deploy failed, UsGov -> Storage Permission error |
The effect is not very good, we need try another way.
Find something useful in Azure RBAC Troubleshoot:
Symptom - Role assignment changes are not being detected
You recently added or updated a role assignment, but the changes are not being detected. You might see the message Status: 401 (Unauthorized).
Cause 1
Azure Resource Manager sometimes caches configurations and data to improve performance. When you assign roles or remove role assignments, it can take up to 30 minutes for changes to take effect.
Solution 1
If you are using the Azure portal, Azure PowerShell, or Azure CLI, you can force a refresh of your role assignment changes by signing out and signing in. If you are making role assignment changes with REST API calls, you can force a refresh by refreshing your access token.
If you are add or remove a role assignment at management group scope and the role has DataActions, the access on the data plane might not be updated for several hours. This applies only to management group scope and the data plane.
I think there are 3 ways to resolve the problem(or improve IT success rate):
Microsoft.Resources/deploymentScripts
resourceScript content is "start-sleep -Seconds 120", sleep 120s to wait the role assignment effect.
I try 9 rounds after adding Microsoft.Resources/deploymentScripts
resource, 4 passed and 5 failed (resource deploy issue). Looks good.
pros: easy and not waste resource
cons: Azure China Cloud don't support Microsoft.Resources/deploymentScripts
resource, increase IT run time
Similar with way 1. Using dummy resource replace of sleep script.
I try to put ServiceBus and EventHubs to top2 deploy resource, but EventHubs also sometimes occur storage permission error, refer https://github.com/Azure/azure-sdk-for-java/issues/31355#issuecomment-1286488480. So i think we need adding dummy resource.
pros: easy cons: resource waste, increase IT run time
Refer Azure RBAC Troubleshoot -> Symptom - Role assignment changes are not being detected.
pros: will fix it completely. cons: lots of modifications.
Choose adding dummy resource at last.
IT results:
Link | Status | Failed IT | Reason | Dummy Resource Deploy Time Spent(random choose 1 job) |
---|---|---|---|---|
try 1 link | Public failed | EventHubsBinderRecordModeIT | Consumer didn't startup, but no error log(not permission issue) | 3m54s |
try 2 link | China failed | EventHubsBinderSyncModeIT | Consumer didn't startup, but no error log(not permission issue) | 2m19s |
try 3 link | passed | / | / | 3m35s |
try 4 link | China failed | / | deploy issue | 2m12s |
try 5 link | passed | / | / | 4m33s |
try 6 link | passed | / | / | 3m48s |
try 7 link | passed | / | / | 3m45s |
try 8 link | UsGov failed | / | deploy issue | 6m63s |
try 9 link | UsGov failed | / | deploy issue | 7m59s |
try 10 link | passed | / | / | 4m36s |
There is no permission issue, looks good.
Closing this issue via adding dummy resource and modify the sequence of deploy resource to improve IT success rate.
Context
Pipeline IT jobs always run failed, refer Jobs in run #20221004.1, Jobs in run #20221007.1 and Jobs in run #20220929.1.
It seems both IT failed because of the Access control is not active.
Rerun IT maybe success but I don't think it is the best way. We need investigate why it always run failed.
Goal
Making java-spring-test pipeline jobs more stable.