HodorNV / ALOps

ALOps
56 stars 24 forks source link

[Riddle] Deployment hangs when run from Azure-hosted server #775

Open PeterConijn opened 1 month ago

PeterConijn commented 1 month ago

We have an issue in our releases, and I am not sure this is directly linked to ALOps, but I also do not know where else to go, so I am hoping the community can offer some insight.

Story We used to run our build and deployment agents on our own servers, which worked fine. Due to ageing hardware, we decided to make the move to Azure-hosted servers. This included a server to house our agents.

As far as building goes, this works perfectly and we have no complaints. It is during deployment (Release pipeline) to SaaS that we are seeing an issue.

Deployment Background: App Structure Our app structure is not entirely dissimilar to Microsoft's: we have a small System App with some basic functionality, licensing stuff and whatnot. On top of that we have a "Base App" (Dysel W1) with the bulk of our stuff. On top of that we have localization apps and apps with specialized functionalities.

The Issue Ever since we moved to the Azure server, the deployment of the System app and our Dysel W1 Base App just.... hangs. It performs the deployment without issue (including uninstalling an reinstalling all the dependent apps), but then it just.... stops. No error, no problem, it just keeps running and doing nothing until the timeout hits (timeout is set to 60 minutes). This only happens for those two apps. All the others are fine.

Worse: it's not consistent. Yesterday it deployed the system app to our NA Accept inside 5 minutes; today, it fails. I will attach the logs, but I am out of ideas. Since these is no error, the Event Log on the server also shows nothing.

Task Log: tasklog_11.log

Yml Deployment_NA_yml.txt

Telemetry on hanging deployment for Dysel NL image

Trace Log for hanging deployment for Dysel NA image

PeterConijn commented 1 month ago

I checked Memory and CPU on the agent host and they're both holding steady between 15%-30%; Disk activity fluctuates, but is not consistently high.

Once a deployment hangs, I can perform other deployments (of other apps) on that environment without issue, so it is really the pipeline/agent that seens to hang on deployment completion and not the deployment itself. I just don't know why....

PeterConijn commented 1 month ago

Interesting development. I decided to publish the W1 app from VSCode.

It did the norm

[2024-07-04 13:26:04.77] Publishing AL application using launch configuration 'DO NOT PUBLISH: Dysel NL Acceptance Sandbox'.
[2024-07-04 13:26:06.18] Acquiring token for authority https://login.microsoftonline.com/[[REDACTED]] using correlation [[REDACTED]].
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code [[REDACTED]] to authenticate.
[2024-07-04 13:27:03.53] Authenticated as user '*******@*****.**' in tenant '[[REDACTED]]'. Please note that these credentials are cached. Clear the credentials cache to authenticate as another user.
[2024-07-04 13:27:03.53] Targeting Dynamics 365 Business Central environment tenant '[[REDACTED]]'.
[2024-07-04 13:27:03.53] Sending request to https://api.businesscentral.dynamics.com/v2.0/DyselNL-Acceptance/dev/metadata?tenant=[[REDACTED]]
[2024-07-04 13:27:04.57] Publishing package to tenant '[[REDACTED]]'
[2024-07-04 13:27:04.59] Sending request to https://api.businesscentral.dynamics.com/v2.0/DyselNL-Acceptance/dev/apps?tenant=[[REDACTED]]&SchemaUpdateMode=synchronize&DependencyPublishingOption=default

The last phrase is sending the request (as you would normally do). Normally, you'd get a completion event and the browser would start.

But that did not happen. The update has been completed, but the callback never came back. Either it's not being sent by the Sandbox or the callback listener reaches a timeout. I just don't know what to do about that.

PeterConijn commented 1 month ago

This is turning into a monologue a bit, but I have already put this on Microsoft Yammer here.

waldo1001 commented 1 month ago

Well - not much I can say.

I'm going to assume you are working with an AppSource-app, which you want to test in an online sandbox, and that's why you use the Dev-endpoint?

We have had some new behavior in other cases (in the normal publish from PowerShell) where it doesn't report back any issues either - although in that case, it's different: there is indeed an error, only the PS command doesn't throw the error.

I don't have any suggestions to test, quite honestly .. 🤔.

PeterConijn commented 1 month ago

Thanks for responding, @waldo1001. I figured there was little you could do, but I thought I'd post it in case someone in the community might have encountered this before. If I do figure out the cause and - more importantly - the solution, I will be sure to update this.

MortenRa commented 1 month ago

Seems to maybe be related to same issue which was not solved https://github.com/HodorNV/ALOps/issues/660 We are still experience the same issue, be set the timeout to 15 minuts and let the next step continue

I tried another approach to uninstall the apps before publishing but the admin center API can not see apps in dev:scope https://github.com/HodorNV/ALOps/issues/634

PeterConijn commented 1 month ago

@MortenRa It seems to be exclusive to apps that have dependent apps in our case. The stages that have apps that are "on top" of that hierarchy and have no apps dependent on them deploy just fine and dandy.

Is that something you have encountered as well?

MortenRa commented 1 month ago

@PeterConijn Yes this is related to dependency apps. especially the lowest level (Library App) which need to unistall / reinstall dependent apps after publish which in our case is also the biggest app we have.

PeterConijn commented 1 month ago

Weird question, maybe, @MortenRa, but where do you host your agents? We had no issue with this while we ran deployment from our own servers, but started getting it when we moved the deployment to the agents on the servers we host on Azure.

MortenRa commented 1 month ago

@PeterConijn could be relevant, we also host our agents on Azure servers

waldo1001 commented 1 month ago

We don't have any agents on Azure. I would assume some kind of port/routing problem in that case? 🤔

If it turns out this is the problem, to know we can help you out with an OnPrem agent service: https://www.alops.be/build-agent/

ChrisKappe commented 1 month ago

This looks like quite similar to our issues we have since some weeks. Azure VMs running the build agents which do the deployment to the Alops external deployer. We still deploy to 18.6 OnPrem in a private cloud. We do batch publish with the ALOps Extension API which now times out after the defined 1h. If I deploy locally with Powershell, each instance is done is approx 10 minutes.

My app structure is similar with some "heavy" dependend apps and some not so heavily dependend apps.

What I can see in my logs is, that ALOps Extension API 1.464.6120 worked and f.ex. 1.465.6167 times out in our case. The other input I might give is that I saw another difference in the ALOps Extension API.

1.464.6120 used the Get-BCArtifactUrl with Latest/OnPrem/W1 and 1.465.6167 uses Weekly/Sandbox/W1. But that should not make such a difference.

PeterConijn commented 1 month ago

Honestly, I am still suspecting a timeout, since deploying top-level apps (with no dependent apps) consistently works and occasionally our deployment for our "system" and "base" apps work. I'm having our SysAdmin look into it and will report back any findings.

PeterConijn commented 1 month ago

I am attaching the build agent logs of a hanging run.

Start-Worker_20240710-074555-utc.log End-Worker_20240710-074937-utc.log

PeterConijn commented 1 month ago

This weekend, I used the release of 24.3 to perform a little experiment. I scheduled the update of the SaaS sandboxes for this weekend, knowing it would uninstall and unpublish all Dev-deployed apps. This means that our System and W1 app no longer had dependent apps.

Lo and behold, the system deployed without a hitch and so did the W1 app on our NA SaaS environment. The NL did display the same issue, but we have noticed that that environment is consistently slower than our NA.

This seems to support the theory that the deployment request to https://api.businesscentral.dynamics.com/[....] times out. Is there any way anyone knows of that we can increase that timeout to definitively test this theory?

PeterConijn commented 3 weeks ago

Ran another test that entailed manually uninstalling and unpublishing a few apps before running the System App and W1 deployments and as expected, the deployments went off without a hitch, so my request timeout theory still holds.

So, as a stopgap measure, @waldo1001: is there a way to change the release pipeline to first uninstall and unpublish certain apps from a SaaS sandbox? Then we can add this step before deploying the System and W1 (the other apps will be updated in the next steps anyway).

PeterConijn commented 1 week ago

We have worked around this by disabling/restructuring some install code, so that the reinstall of dependent apps in the sync step goes faster and stays within the timeout threshold. This seems to be doing the trick for now.

It is, however, not a solution.

waldo1001 commented 5 days ago

This issue is getting complicated, as multiple cases seems to be mixed together.

So, as a stopgap measure, @waldo1001: is there a way to change the release pipeline to first uninstall and unpublish certain apps from a SaaS sandbox? Then we can add this step before deploying the System and W1 (the other apps will be updated in the next steps anyway).

We cannot uninstall dev-port deploys, as @MortenRa already mentioned. Microsoft doesn't allow that, apparently. But the "AdminCenter" step does allow an app-uninstall for a normal app. I'm just not sure that would help anyone here.

On timeouts, there's nothing we can do, I'm afraid. ALOps calls the Microsoft API's, if they time out, they time out. All we can do is either fail or not fail the pipeline in that case.