Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.64k stars 843 forks source link

Provide single API to create VM, NIC and Disks with cascade deletion option #20874

Closed unmarshall closed 2 months ago

unmarshall commented 1 year ago

Feature Request

We are currently migrating from an older azure GO SDK to the latest one. The one that we experiment with is as below:

github.com/Azure/azure-sdk-for-go/sdk/azcore v1.5.0
github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.2.2
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute/v4 v4.2.1
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/network/armnetwork/v2 v2.2.1

In the older SDK, NIC, disk had to be created separately and then you could reference these resources in compute.VirtualMachine when attempting to create a VM. We have seen problems where if the NIC creation fails then we have to delete all resources created so far since creation of all these resources needs to be created atomically (all or none) in our case otherwise there will be orphaned/dangling resources left behind for which another actor (orphan collector) needs to be introduced (not to mention we will have to also keep track of all the resources that are now orphaned and needs collection/deletion).

The other issue is that the deletion of these resources could also fail and that leaves us no choice w.r.t handling orphaned resources. Not to mention there are already existing issues where there is delay in distributed cache refresh which results in GET calls returning incorrect/stale results (TrackingID#2212190050001399). Given all the current issues the handling of all edge cases due to unreliable API behavior (as observed through a SDK client) makes the code more complex and brittle.

What we need is a single API which a consumer can use to create NIC, Disks and VM with an ability to have cascade deletion option for each of these resources.

In the new API armcompute.VirtualMachine.VirtualMachineProperties has armcompute.NetworkProfile which in turn has []*VirtualMachineNetworkInterfaceConfiguration. Unfortunately this configuration is not as comprehensive as armnetwork.Interface which additionally provides Tags and a way to specify PrivateIPAllocationMethod along with other configuration properties.

If armcompute.VirtualMachine.VirtualMachineProperties could be enhanced to provide complete configurations for NICs and Disks then a single call needs to be made from the consumer perspective. This ensures that if any one of these resources fails to be created then azure server side will ensure atomicity of the create request. When the call returns it will give the exact failure reasons and also ensure that these resources are completely cleaned up. In the absence of this feature, every consumer will have to build this functionality and that is too much duplication of effort and code.

tadelesh commented 1 year ago

@JeffreyRichter @jhendrixMSFT I think the customer is asking for either service API could handle cascade resource creation/deletion atomically or SDK side could provide such feature. I've asked service team to take a look. What's your opinion?

JeffreyRichter commented 1 year ago

I agree with the sentiment, and this is much better handled on the service side; not the client/SDK side especially since it needs to be rock-solid to avoid leaking of (dangling) resources.

tadelesh commented 1 year ago

@Drewm3, @TravisCragg-MSFT, @nikhilpatel909, @sandeepraichura, @hilaryw29, @GabstaMSFT, @ramankumarlive, @ushnaarshadkhan could anyone help?

sandeepraichura commented 1 year ago

thanks for reporting, We are looking into it and keep this thread updated.

TravisCragg-MSFT commented 1 year ago

Moving to a Feature Request. This will be assessed during our fall semester planning..

TravisCragg-MSFT commented 4 months ago

@unmarshall Thank you for your feedback. Unfortunately, we will not be implementing this feature at this time. Please use feedback.azure.com for further feature requests. Also, check out https://learn.microsoft.com/en-us/azure/virtual-machines/delete?tabs=portal2%2Ccli3%2Cportal4%2Cportal5, as this is kind of possible today.

unmarshall commented 4 months ago

@TravisCragg-MSFT unfortunately you have linked a feature for cascade delete which we are well aware of and also use it today. Maybe you can re-read the original issue whose main target was to prevent separate calls to first create a NIC and then reference that NIC when creating the VM. All cloud providers (AWS, GCP) provide single API to create all resources including NIC, disk and VM. It is a pity that you have decided against it especially when we have faced problems with NIC creation subsequently followed by a GET which results in inconsistent results.

In general it is recommended practice to not have to create these entities separately and i am now wondering what are the reasons you decided against it. I hope you take out the time to list down your reasons.

unmarshall commented 4 months ago

/reopen

TravisCragg-MSFT commented 3 months ago

@unmarshall my apologies, I misread this issue, and what you are looking for is a way to create VMs and associated resources that delete if the associated VM is deleted and do this in one call.

This is possible today using the "deleteOption": "Delete" property on the NIC and Disks

If the behavior described in that doc is what you are looking for, let me know and I can send this back to the Python SDK team for potential implementation.

unmarshall commented 3 months ago

Hello @TravisCragg-MSFT.

Thanks for reopening the issue. Ask is simple. Have a single API to create: NIC, Disks and VM. Creation of all these resources should be atomic - either all succeed or none. This ensures determinism and ease of consumption from the consumers.

We are well aware of the deleteOption but that is for deleting these resources. We already use this option today. This is not what i ask for in this issue.

We use the golang SDK so it would nice if you could provide such an atomic API for creation.

To look at other providers:

In AWS: You can do this by specifying all resources in a single RunInstancesInput which can then be provided to RunInstances method. This creates all resources atomically.

In GCP: You can configure all resources in compute.Instance and then use this to create an instance which atomically creates all resources.

JeffreyRichter commented 3 months ago

@TravisCragg-MSFT: This is a service issue, not an SDK issue. All of our ARM SDKs are generated directly from swagger and so the service needs to support the operation, put it in swagger and then all language SDKs pick up the operation. Furthermore, if an SDK has to make multiple calls to the service, then a failure leaves the resources in a corrupt state while the service doing the work can guarantee no service-side state corruption. For this reason, the SDK team would not even consider providing a rich function that performs the multiple operations as we can't guarantee that state would not get corrupted.

So, I want to assign this issue back to you.

unmarshall commented 3 months ago

@JeffreyRichter , @TravisCragg-MSFT - one more case that we have faced recently and repeatedly. A request was made to create a machine (which comprises of creating a NIC, followed by a request to create the VM along with OS and data disks). The NIC got created successfully but due to bad configuration the VM creation failed.

Failed to create VirtualMachine: [ResourceGroup: rg1, Name: vm1], Err: PUT https://management.azure.com/subscriptions/s1/resourceGroups/rg1/providers/Microsoft.Compute/virtualMachines/vm1\\n--------------------------------------------------------------------------------\\nRESPONSE 409: 409 Conflict\\nERROR CODE: OperationNotAllowed\\n--------------------------------------------------------------------------------\\n{\\n  \\\"error\\\": {\\n    \\\"code\\\": \\\"OperationNotAllowed\\\",\\n    \\\"message\\\": \\\"The specified disk size 20 GB is smaller than the size of the corresponding disk in the VM image: 30 GB. This is not allowed. Please choose equal or greater size or do not specify an explicit size.\\\",\\n    \\\"target\\\": \\\"osDisk.diskSizeGB\\\"\\n  }\\n}\\n--------------------------------------------------------------------------------\\n]\"","pid":"1","severity":"INFO","source":"machine.go:116"}

When the NIC got created, no VM was associated to it and that was expected. However if the VM creation fails it still goes ahead and creates an association to the NIC. This is a bit weird. When we now try to clean this up after the correcting the configuration then it does not allow us to delete the NIC which unfortunately got associated to a non-existing VM.

{"log":"    \"message\": \"Nic(s) in request is reserved for another Virtual Machine for 180 seconds. Please provide another nic(s) or retry after 180 seconds. Reserved VM: /subscriptions/s1/resourceGroups/rg1/providers/Microsoft.Compute/virtualMachines/vm1\","}

There is bug in VM association to NIC but in general if NIC and VM are created in a single atomic call, these edge cases can be avoided altogether and bring in determinism w.r.t resource creation and their correct association. Therefore i will stress again the importance of having a single atomic API to create VM and all its associated resources. What you have now is something in between, where you allow creation of VM, OS and Data disk in one single atomic call but leave out the NIC. Till now we have not even faced a single issue with disks (creation or deletion) as they can be created and deleted atomically.

JeffreyRichter commented 3 months ago

Yes, I completely agree with you that the service should offer an atomic operation that performs these multiple actions. @TravisCragg-MSFT: I think this ball is in your court now.

TravisCragg-MSFT commented 2 months ago

@unmarshall Thanks for your feedback. We will be submitting this ask for planning, but it is unlikely that this will go through at this time. Feel free to also request this at feedback.azure.com.