Azure / bicep-registry-modules

Bicep registry modules
MIT License
466 stars 314 forks source link

[AVM CI Environment Issue]: Error when checking deployment status #2865

Open cecheta opened 1 month ago

cecheta commented 1 month ago

Check for previous/existing GitHub issues

Issue Type?

Bug

Description

There is an intermittent error that can occur when checking the deployment status during CI:

  VERBOSE: 12:13:38 - Checking deployment status in 8 seconds
  VERBOSE: Resource deployment Failed.. (1/3) Retrying in 5 Seconds.. 

  VERBOSE: An error occurred while sending the request.

  VERBOSE: Deploying with deployment name [a-p-ap-b-max-t2-20240729T1207070374Z]
  VERBOSE: Setting context to subscription [***]
  VERBOSE: Using Bicep v0.29.47
  ...
...

When this occurs, a new deployment is started, however the first deployment is actually still ongoing. The second deployment is then likely to fail because there are essentially two deployments going on at the same time.

Perhaps a retry could be added when checking the deployment status?

microsoft-github-policy-service[bot] commented 1 month ago

[!IMPORTANT] The "Needs: Triage :mag:" label must be removed once the triage process is complete!

[!TIP] For additional guidance on how to triage this issue/PR, see the BRM Issue Triage documentation.

microsoft-github-policy-service[bot] commented 1 month ago

[!WARNING] Tagging the AVM Core Team (@Azure/avm-core-team-technical-bicep) due to a module owner or contributor having not responded to this issue within 3 business days. The AVM Core Team will attempt to contact the module owners/contributors directly.

[!TIP]

  • To prevent further actions to take effect, the "Status: Response Overdue 🚩" label must be removed, once this issue has been responded to.
  • To avoid this rule being (re)triggered, the ""Needs: Triage :mag:" label must be removed as part of the triage process (when the issue is first responded to)!
AlexanderSehr commented 1 month ago

Hey @cecheta, good catch. I think I've seen this happening in a recent APIM deployment. This should be addressed but will be challenging. For one, we need to reproduce the issue while debugging. Then, we must hope that ARM actually returns some proper error that we can interpret because more often than not, information is written to the log, but not actually returned by the cmdlet. If it turns out it does not return anything useful, we may need to resort to more drastic means and add a logic that picks up after the deployment cmdlet and always pings the deployment itself with some waiting logic (effectively pulling the deployment data every x seconds until it's done).

Would you happen to have noticed a service where this occurs somewhat consistently?

cecheta commented 1 month ago

Unfortunately I haven't observed this behaviour consistently for any service