Azure / deployment-environments

Sample infrastructure-as-code templates to get started with Azure Deployment Environments service.
MIT License
68 stars 223 forks source link

Dev center reports deployment as failed when actually succeeded #11

Open mahomedalid opened 1 year ago

mahomedalid commented 1 year ago

This is a tricky report because it only happened once, and we could not replicate it anymore.

We created an environment, for a catalog item that took close to 4 minutes to complete. The devcenter project reported that the deployment failed, although the deployment was still in progress in the resource group. The deployment completed successfully but the dev center project did not update the status of the environment as successful.

Our best guess was regarding some timeout on the devcenter, however in subsequent environment creations this error did not occurred anymore.

The catalog item is based on this resource group: https://github.com/Azure/aks-baseline-automation/tree/main/IaC/bicep/rg-hub

philricelf commented 3 months ago

I have seen this happen a few times now and similar experience - devportal reports failure but deployment is actually successful. A redeploy fixes it and shows successful in devportal without any changes having been made

ericaguthan commented 3 months ago

What was the specific failure that your environment reported with? If it was a timeout error, it is definitely possible that the reported failure was due to the environment hitting the deployment runtime but the resources themselves continued. Details on what the limits for ADE deployments are can be found here https://azure.microsoft.com/en-us/pricing/details/deployment-environments/#:~:text=Azure%20Deployment%20Environments%20has%20the%20following%20limits%3A%20Runtime,1%20GB%20Enterprise%20might%20qualify%20for%20additional%20limits. and the limits can be raised via a request following the process described in the link

philricelf commented 3 months ago

I cant remember exactly , but looking at those limits I can see how a deployment can easily go over the 10 min deployment run time limit for the things I was testing with. I was deploying Vnet, App Service Plan, Web Apps along with managed service identities. I was thinking it could very much be that the web apps get deployed as resources, but what ever checks are performed to confirm success rely on them being fully up, and they seem to take a little while to spin up fully post deployment. Redeployment without changes to the bicep will of course be faster and then succeed.

I am glad you posted those limits though, as I was not aware and 200 mins per region, per sub for runtime deployments does not seem like a lot - at least not with how I envisaged using it (allowing devs lots of spin up test and destroy) and I will need to rethink things as this , along with several bugs Im finding, mean I dont think I can go ahead and use this currently.

ericaguthan commented 3 months ago

Hi @philricelf,

As mentioned, it is a really straightforward process to extend those limits through the official request process if they seem too low for your needs. Each request is personally reviewed by the ADE team and providing appropriate business justifications for the increases you request can help us work with you to get what you need. These limits also apply at a per-region per-subscription level, so if your use case is more region spread, these limits will feel higher.

Can you elaborate on what bugs you are currently facing? We'd be happy to take a look and see if we can get you unblocked! you can also reach out to adesupport@microsoft.com if there are any details related to your bugs that you'd rather not put on a GitHub issue.

philricelf commented 3 months ago

@ericaguthan - sorry for the delay in replying - been a bit full on with some deadlines. I have had to shelve the idea of using ADEs for now for the use case I had due to the issues I was encountering, some of which I have logged or commented like this one, but I will try to list at a high level what I can remember I encountered:

Deployment/deletion failures - as per issue #59 that I logged and also as mentioned above . I think for this one , the documentation should be made a bit cleared about the limits, as I can see that can effect planning of how you might approach things. For example, I was using templates that deployed complete environments including vnet, subnets, managed identities, storage accounts, container apps environments and container apps, private endpoints etc etc - obviously these take longer when multiple resources and certain resource types add further delays to spin up. After learning about the job time limits I can see how it may need to be approached / used as a solution to deploy info already existing base infrastructure (so vnet, identities etc already in place in preconfigured 'dev/test/ networks' and just using the ADE to spin up VMs , Functions etc . I was thinking of potential use cases that would allow certain engineers to use ADE to spin up ready made test platforms, potentially with quite complex configs , so they can have 'lab's to try things in, but these time limits seem to handicap what is possible a bit.

Deleted Global catalog still showing for end users after being deleted. This happened when I did the following:

  1. Initially I deployed a catalog at dev center level
  2. I realised that this was automatically available in all projects so I deployed a catalog at project level (same source for this but had to name the catalog with a different name) and deleted the global one
  3. Tried deploying resources from devportal and could see both catalogs, which showed both sets of definitions - confusing for enduser
  4. Tried various things like deleting the project level one (so no catalogs) and redeploying to see if caused by deploying from same source at same time, but no change. As there was no longer any catalogs at global level, I tried deploying the default samples one to see if that forced an update but the old 'ghost' one still remained also

Other comments/suggestions: Would be useful to have a more detailed catalog interface for the end users , as when you launch an environment you are presented with limited info. Making catlog navigation front and center , and providing a launch button from there which steps into the launch details would be a better experience.

I could not see any way that an admin can limit the length of time that an environment can be running in the manifest. E.g Users can set an expiry when launching and admins can add/edit this later, but I would want to be able to configure environments to have 'max lifetime' = 7 days for example , only overidable by an admin and nice if there was a 'request extention' button for the end user

I couldnt see documentation that showed how to use dynamic values in a manifest . For example having parameter1 requiring input and parameter2 being displaying different options (or not displayed at all) based on the chosen value of parameter1 . That could be a powerful feature. There may be more advanced configurations available that what I found in the documentation of course, but if so it would be good to have this more readily available also.

Thanks Phil