Closed dsfrederic closed 9 months ago
@mdrakiburrahman or @johndowns can you provide us with any help?
@dsfrederic Sorry mate, I haven't touched this tech in a year (since the PR I opened 2 years ago wasn't looked at, I'm not sure if anyone actually monitors this repo at all).
Your best bet might be to raise an Azure Support Ticket and badger the CSS Engineer to get you in contact with the Data Factory Product Team.
@mdrakiburrahman that's what i thought. It's a shame though because this would make deployment a lot more versatile.
I appreciate you for responding to my message!
@dsfrederic - no problem!
Btw the problem you're facing in particular, that can be solved with Kubernetes primitives. For example, when the container gets a SIGTERM
in K8s, you can intercept it, and as long as the Pod has a Termination Grace Period, the code will run - see simple example:
https://stackoverflow.com/a/24574672/8954538
I'm not sure if this is possible with PowerShell though: https://github.com/PowerShell/PowerShell/issues/1040
So my point is, the content in this git repo isn't really ready for Kubernetes/Production in it's current state, it's a simple PowerShell wrapper around 'dmgcmd.exe' that's good for demos etc.
The missing piece required is a Kubernetes Operator (written in a proper language, like Python, C# or Go) that can handle production edge cases like the one you're facing:
https://kubernetes.io/docs/concepts/extend-kubernetes/operator
@dsfrederic Does PR https://github.com/Azure/Azure-Data-Factory-Integration-Runtime-in-Windows-Container/pull/12 and https://github.com/Azure/Azure-Data-Factory-Integration-Runtime-in-Windows-Container/pull/13 fix your issue?
@byran77 and @xumou-ms is that offline node auto-deletion deocumented on ADF docs or on the ADF tooling itself?
@dsfrederic Hi, it seems that high availability is disabled and the old node is leaked. Please set the environment variables ENABLE_HA=true
and ENABLE_AE=true
when starting the docker container with the latest image. With the flags, RemoteAccess is enabled, and new registration will remove expired nodes automatically. Also you can clear the offline nodes which are not expired yet manually. Hope it helps.
@jikuja We are working on doc update now, thanks!
Hello guys, I think I have a similar problem, I have deployed the runtime into a windows container ( in azure app service), and I have enabled : ENABLE_HA=TRUE and ENABLE_AE=TRUE, but when restarting azure app service, a new node is creating, but the old node is not removing automatically in INTEGRATION RUNTIME ( SELF HOSTED), maybe is there anything that I am missing? Thank you in advance
Works for me with ACI: https://github.com/jikuja/azure-data-factory-runtime-app-service/tree/aci therefore the underlying should work.
at least for me if I am restarting the app service, a new node is spinning in Integration Runtime, but old nodes becomes unavailable and the ENABLE_AE it's not cleaning the unavailable node ( old node ). Maybe I am missing something?
I'm having the same issue as @radulaurentiu02
Hi @radulaurentiu02 and @sergiupoliec, did you specify AE_TIME
? By default, the value of AE_TIME
is 600
, which means the old node will be removed after 10 minutes (600 seconds).
@radulaurentiu02 @sergiupoliec
Hi, currently the old nodes will be removed automatically only when the offline time duration has exceeded AE_TIME
(default and minimum: 10 minutes) and a new node is registered. In app service, it works when the app has stopped for AE_TIME
and starts again. Cleaning manually is still needed when restarting the app directly.
We are working on a better experience in app service now. Thanks!
Hi @radulaurentiu02 and @sergiupoliec ,
AE_TIME
are treated as expired nodes and will be automatically removed.Thanks for your feedback! Let us know if you have any questions.
@Zengqwei Does it remove only nodes with identical name or any expired node?
@jikuja It will remove all expired nodes. In fact, identical names do not matter here, because the IR will add different suffix to the name when the node is registered successfully.
Hi @radulaurentiu02 @sergiupoliec @jikuja Now the offline nodes will be removed automatically after expiration timeout duration. Please let me know if there is any other issue. Thanks!
Hi,
When I'm restarting the docker container I'm getting the following error
The other node it's referring to is my previous registration of the same node. After deleting the old node (with has the same name etc) registration succeeds again.
Is there a workaround for this? It seems that a cleanup before registering again should do the trick.
Context: I'm running this windows container in an AKS edge essentials environment. Although this is a new and experimental setup this issue doesn't seem to be related.
Update: I'm not able to run the -EnableRemoteAccess command because the old node is already gone.