Azure / Azure-Data-Factory-Integration-Runtime-in-Windows-Container

Azure Data Factory Integration Runtime in Windows Container Sample
MIT License
25 stars 36 forks source link

Cleanup previous connected node #11

Closed dsfrederic closed 9 months ago

dsfrederic commented 1 year ago

Hi,

When I'm restarting the docker container I'm getting the following error

Registration of new node is forbidden when Remote Access is disabled on another node. To enable it, you can login the machine where the other node is installed and run 'dmgcmd.exe -EnableRemoteAccess "<port>" ["<thumbprint>"]'.

The other node it's referring to is my previous registration of the same node. After deleting the old node (with has the same name etc) registration succeeds again.

Is there a workaround for this? It seems that a cleanup before registering again should do the trick.

Context: I'm running this windows container in an AKS edge essentials environment. Although this is a new and experimental setup this issue doesn't seem to be related.

Update: I'm not able to run the -EnableRemoteAccess command because the old node is already gone.

dsfrederic commented 1 year ago

@mdrakiburrahman or @johndowns can you provide us with any help?

mdrakiburrahman commented 1 year ago

@dsfrederic Sorry mate, I haven't touched this tech in a year (since the PR I opened 2 years ago wasn't looked at, I'm not sure if anyone actually monitors this repo at all).

Your best bet might be to raise an Azure Support Ticket and badger the CSS Engineer to get you in contact with the Data Factory Product Team.

dsfrederic commented 1 year ago

@mdrakiburrahman that's what i thought. It's a shame though because this would make deployment a lot more versatile.

I appreciate you for responding to my message!

mdrakiburrahman commented 1 year ago

@dsfrederic - no problem!

Btw the problem you're facing in particular, that can be solved with Kubernetes primitives. For example, when the container gets a SIGTERM in K8s, you can intercept it, and as long as the Pod has a Termination Grace Period, the code will run - see simple example:

https://stackoverflow.com/a/24574672/8954538

I'm not sure if this is possible with PowerShell though: https://github.com/PowerShell/PowerShell/issues/1040

So my point is, the content in this git repo isn't really ready for Kubernetes/Production in it's current state, it's a simple PowerShell wrapper around 'dmgcmd.exe' that's good for demos etc.

The missing piece required is a Kubernetes Operator (written in a proper language, like Python, C# or Go) that can handle production edge cases like the one you're facing:

https://kubernetes.io/docs/concepts/extend-kubernetes/operator

jikuja commented 1 year ago

@dsfrederic Does PR https://github.com/Azure/Azure-Data-Factory-Integration-Runtime-in-Windows-Container/pull/12 and https://github.com/Azure/Azure-Data-Factory-Integration-Runtime-in-Windows-Container/pull/13 fix your issue?

@byran77 and @xumou-ms is that offline node auto-deletion deocumented on ADF docs or on the ADF tooling itself?

byran77 commented 1 year ago

@dsfrederic Hi, it seems that high availability is disabled and the old node is leaked. Please set the environment variables ENABLE_HA=true and ENABLE_AE=true when starting the docker container with the latest image. With the flags, RemoteAccess is enabled, and new registration will remove expired nodes automatically. Also you can clear the offline nodes which are not expired yet manually. Hope it helps.

@jikuja We are working on doc update now, thanks!

radulaurentiu02 commented 1 year ago

Hello guys, I think I have a similar problem, I have deployed the runtime into a windows container ( in azure app service), and I have enabled : ENABLE_HA=TRUE and ENABLE_AE=TRUE, but when restarting azure app service, a new node is creating, but the old node is not removing automatically in INTEGRATION RUNTIME ( SELF HOSTED), maybe is there anything that I am missing? Thank you in advance

jikuja commented 1 year ago

Works for me with ACI: https://github.com/jikuja/azure-data-factory-runtime-app-service/tree/aci therefore the underlying should work.

radulaurentiu02 commented 1 year ago

at least for me if I am restarting the app service, a new node is spinning in Integration Runtime, but old nodes becomes unavailable and the ENABLE_AE it's not cleaning the unavailable node ( old node ). Maybe I am missing something?

sergiupoliec commented 1 year ago

I'm having the same issue as @radulaurentiu02

xumou-ms commented 1 year ago

Hi @radulaurentiu02 and @sergiupoliec, did you specify AE_TIME? By default, the value of AE_TIME is 600, which means the old node will be removed after 10 minutes (600 seconds).

byran77 commented 1 year ago

@radulaurentiu02 @sergiupoliec Hi, currently the old nodes will be removed automatically only when the offline time duration has exceeded AE_TIME (default and minimum: 10 minutes) and a new node is registered. In app service, it works when the app has stopped for AE_TIME and starts again. Cleaning manually is still needed when restarting the app directly. We are working on a better experience in app service now. Thanks!

Zengqwei commented 1 year ago

Hi @radulaurentiu02 and @sergiupoliec ,

  1. Only nodes which is offline for AE_TIME are treated as expired nodes and will be automatically removed.
  2. We only remove the expired nodes when Azure App Service starts a new container (register a new node). If some nodes reach expiration but there is no new container started, they will still show offline on portal. This is a known display issue but doesn’t impact any capability. These expired nodes will be removed automatically when next container starts.
  3. Next step we will fix this display issue for better user experience.

Thanks for your feedback! Let us know if you have any questions.

jikuja commented 1 year ago

@Zengqwei Does it remove only nodes with identical name or any expired node?

byran77 commented 1 year ago

@jikuja It will remove all expired nodes. In fact, identical names do not matter here, because the IR will add different suffix to the name when the node is registered successfully.

byran77 commented 1 year ago

Hi @radulaurentiu02 @sergiupoliec @jikuja Now the offline nodes will be removed automatically after expiration timeout duration. Please let me know if there is any other issue. Thanks!