Azure / azure-cli

Azure Command-Line Interface
MIT License
3.97k stars 2.95k forks source link

'az network nsg list' returns an incorrect result #17910

Open glazychev-art opened 3 years ago

glazychev-art commented 3 years ago

Describe the bug

Hello, I use az aks create ... to create a cluster. After successful creation, I want to receive information about NSG using az network nsg list. But it returns the empty result. I also tried something like az resource list --resource-type Microsoft.Network/networkSecurityGroups- it does not help. But I see via Portal, that NSG is successfully created. I need to wait a while (about 10 min) to get the correct result via az network nsg list. The same behavior is for az network vnet.

Could you please explain, what needs to be triggered for correct behavior? Or just need to call additional wait?

Additional context I found that there is very similar issue #15303, but there is no answer for a long time.

Thank you.

yonzhan commented 3 years ago

network

ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @aznetsuppgithub.

Issue Details
**Describe the bug** Hello, I use `az aks create ...` to create a cluster. After successful creation, I want to receive information about NSG using `az network nsg list`. But it returns the empty result. I also tried something like `az resource list --resource-type Microsoft.Network/networkSecurityGroups`- it does not help. But I see via Portal, that NSG is successfully created. I need to wait a while (about 10 min) to get the correct result via `az network nsg list`. The same behavior is for `az network vnet`. Could you please explain, what needs to be triggered for correct behavior? Or just need to call additional `wait`? **Additional context** I found that there is very similar issue #15303, but there is no answer for a long time. Thank you.
Author: glazychev-art
Assignees: kairu-ms
Labels: `Network`, `OKR3.2 Candidate`, `Service Attention`, `feature-request`
Milestone: S188
kairu-ms commented 3 years ago

CLI directly calls service API to get the result. So only service team can answer why there's delay.

mstavrev commented 3 years ago

Hi, I am the author of #15303 and can confirm this is basically the same problem I have. This problem exists now for more than nine months (I had it much earlier, but only reported in March as I had to first exhaust all other possible explanations).

This is a drastic inconsistency with the data shown by Azure Portal itself and creates problem to automate some Azure related operations. Can we at least get some input from the 'Service Team' as to how to mitigate it? May be there is an API call we can make to cause the API to forcefully refresh the information and return back the actual, live state?

yonzhan commented 3 years ago

network service team should look into this issue.

mstavrev commented 3 years ago

How can it be escalated? Is there any time-frame for at least getting on their agenda?

kairu-ms commented 3 years ago

Add @FumingZhang from AKS team.

@FumingZhang can you help to explain why there's delay to get nsg and vnet value after AKS cluster created?

FumingZhang commented 3 years ago

Agree with @yonzhan, members of the network team should be asked to investigate the reason.

andyliuliming commented 3 years ago

@kairu-ms if the resource is created, and people can not use "az network nsg list" to list it, then it must be some latency or cache issue from network RP. but need to get someone from network team(or ARM team) to confirm it.

mstavrev commented 3 years ago

Today one of my AKS deployment the NSG appeared on the listing after about 4:30 hours!

How does it come that Azure Portal (which is basically just a fancy UI) displays the correct details? Doesn't it use the same APIs available programmatically we are using?

kairu-ms commented 3 years ago

Hi @tzwlai, appreciate if you could help with this issue.

mstavrev commented 3 years ago

Any update? At least time-frame for resolving the problem?

kairu-ms commented 3 years ago

I've create an ICM 249121407 to follow this issue.

kairu-ms commented 3 years ago

@mstavrev which region are you using? Could you add the commands to reproduce this issue without sensitive information.

mstavrev commented 3 years ago

I am using West Europe region.

Here is an example script to illustrate the issue: aks-nsg-issue.txt

Once reaching step 5, I've also did experiments to request the NSG list using other accounts with access to the subscription, but it does not matter.

Edit: There is a small issue with the script, which I did too quickly. The listing of the NSG groups is querying for the RG of the AKS itself, but the NSG is created under a RG that is automatically created for the nodes and other network resources automatically created as part of the AKS. Please, execute the script up to this step, but then use the name "MC_akstest-rg_aks04_westeurope" for the group being queried. You can also omit the -g option in case you don't have other NSG under the subscription besides the one created as part of the AKS creation.

ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @aznetsuppgithub.

Issue Details
**Describe the bug** Hello, I use `az aks create ...` to create a cluster. After successful creation, I want to receive information about NSG using `az network nsg list`. But it returns the empty result. I also tried something like `az resource list --resource-type Microsoft.Network/networkSecurityGroups`- it does not help. But I see via Portal, that NSG is successfully created. I need to wait a while (about 10 min) to get the correct result via `az network nsg list`. The same behavior is for `az network vnet`. Could you please explain, what needs to be triggered for correct behavior? Or just need to call additional `wait`? **Additional context** I found that there is very similar issue #15303, but there is no answer for a long time. Thank you.
Author: glazychev-art
Assignees: kairu-ms
Labels: `Network`, `Service Attention`
Milestone: Backlog
mstavrev commented 3 years ago

Can you provide some work-around?

It would be sufficient to know how the NSG created automatically is named, because when I use:

az network nsg show -g AKS_nodepool_RG --name The_name_of_the_NSG_I_get_from_Azure_Portal

I can read the details of the NSG, all though at the same time, calling:

az network nsg list -g AKS_nodepool_RG

yields an empty list :(

kairu-ms commented 3 years ago

Can you provide some work-around?

It would be sufficient to know how the NSG created automatically is named, because when I use:

az network nsg show -g AKS_nodepool_RG --name The_name_of_the_NSG_I_get_from_Azure_Portal

I can read the details of the NSG, all though at the same time, calling:

az network nsg list -g AKS_nodepool_RG

yields an empty list :(

The resource group of nsg is changed. It's not the same as aks resource group.

mstavrev commented 3 years ago

Yes, of course I know that. That's why I've used the pseudo-name "AKS_nodepool_RG" instead of just "AKS_RG" to indicate exactly that. Believe me, I am trying to list the NSG that AKS creates automatically on the correct RG (not the one of the AKS itself).

kairu-ms commented 3 years ago

This is a known replication issue ARM is currently facing and working on improving. @jennyhunter-msft Please help update the status of replication implement work. Thanks.

mstavrev commented 2 years ago

Is there any update to this problem?

jennyhunter-msft commented 2 years ago

@kairu-ms - Is there a related IcM we can use for correlation and investigation? Overall, there is an ongoing ARM replication re-architecture that is highly invested to make ARM more scalable and resilient. However, without the specific details of the calls, I couldn't say how this call is currently being affected and at what points you would see improvements.

kairu-ms commented 2 years ago

@kairu-ms - Is there a related IcM we can use for correlation and investigation? Overall, there is an ongoing ARM replication re-architecture that is highly invested to make ARM more scalable and resilient. However, without the specific details of the calls, I couldn't say how this call is currently being affected and at what points you would see improvements.

I've already created an ICM 249121407 with detail information to follow this issue.

jscheeringa commented 2 years ago

I am hoping this may help others, as well provide some venting for myself...

I have been plagued by this issue for a long time now, I would say years!! It seriously hinders ones ability to reliably automate any sort of deployments through Azure Devops or even on home rolled apps, utilities or services. And probably broadly more tech.

I have experienced this problem with ARM, Powershell: Azure RM and AZ modules, any version. I have experienced this problem making calls directly to rest endpoints. And, now I have experienced it with the AZ cli after switching all code to use it because of the touted reliability. Each time switching to new versions or technology in hopes to get away from this issues. You create a resource and it may or may not show up immediately .

It started noticing it trying to dynamically collect resources under a resource group so that I could apply tags post creation of AKS, via arm. Vnet, public-ip, lb or nsg. I have even experienced this with more basic resources like storage. Doesn't matter what the initiating technology is. I have gone as far as to implement all sorts of rediculous sleeps and retry logic that can cause deployments to take several hours instead of minutes.

I have opened tickets with ARR support and their response (my summation), after much back and forth, was that you can't rely on these calls. And that I should make calls to the resource graph as it gets updated immediately post resource creation and the older resource management updates much more slowly. And resource updates take time to propagate to other regions. I found this particularly troublesome with devops as there is no guarantee when using hosted agents that you will end up in the same region that you are deploying too, and moving from stage to stage your region "could" change and that region may not have had the updates.

I will say this. Graph explorer has been VERY consistent so far. And immediate. I can confirm it is more reliable, but it doesn't solve all the needs that you would rely on these other tools or calls for. I have seen delays on created resources anywhere from 5min to over 24 hours!!

an example of graph explorer query that is guaranteed to return resources would be : az graph query -q "Resources | project Name=name, Location=location, ResourceGroupName=resourceGroup, ResourceType=type, Tags=tags, ResourceId=id | where ResourceGroupName =~ '$($resourceGroupName)'

But of course, even if it returns the items correctly, if you need to operate on the resource you are probably trying to leverage other commands that won't work because the resource isn't visible, except to graph.

This is a serious problem that really needs MS attention. I can appreciate the immense complexity involved, but this is a fundamental problem for anyone trying to really embrace modern development principals and devops.

Please put some fire on this. I would really like to be able to rely on the resources I think i have created.

mukulcho commented 2 years ago

@jscheeringa thanks for summarizing this issue as we are experiencing same issue since last week and MS PG is unable to provide us root cause other then "Slow ARM replication issue between different regions". Our code rely heavily on az resource list for devops automation. As you have mentioned that az graph will not fix this issue as we need to operate on resources. requesting MS to prioritize this resolution.

mukulcho commented 2 years ago

Overall, there is an ongoing ARM replication re-architecture that is highly invested to make ARM more scalable and resilient.

can you please provide detail on ARM replication re-architecture and what is current status? this issue reported here is more than 1 year back with no update on status and resolution.

jennyhunter-msft commented 2 years ago

We've done a lot of work over the past year to reduce these types of replication issues. In addition to our ongoing short-term efforts, as you mentioned, Azure Resource Graph should be able to reliably list these resources. The platform overall will be betting more on Resource Graph in the future, and it already powers close to 80% of portal resource list experiences today. There shouldn't be an issue with single resource reads or writes since we've added additional retry logic that verifies its existence, so please let me know if you confirm seeing issues there. To completely solve this issue involves entirely restructuring the foundation of call routing in Azure, and we've had to restart the design process multiple times since our scale of load far exceeds any of the currently available storage technologies.

mukulcho commented 2 years ago

@jennyhunter-msft Fully understand complexity and scale involved. We have more than 100 repos (with 3 versions) of the product which is using az resource list command, so changing code and using AZ Graph will be huge effort. we are using az resource list command to list single resource and it is still failing intermittently. Same code was working fine for last 2 years now we are facing lot of error message in our Automation.