Azure / Azure-Data-Factory-Integration-Runtime-in-Windows-Container

Azure Data Factory Integration Runtime in Windows Container Sample
MIT License
25 stars 36 forks source link

Multiple SHIR containers with cascading failures - 0x80010002 (RPC_E_CALL_CANCELED)) #7

Open nickcva opened 1 year ago

nickcva commented 1 year ago

We are running several windows SHIR containers on the same physical machines all containers are using the same network and default nat docker switch. Once one container is unhealthy it starts to slowly cascade to the rest of the SHIR containers. We do not use a proxy and network issues are not occurring between the onprem and Azure ADF/Synapse instance.

Are their any issues with running multiple SHIR containers on the same host that all connect to different Azure ADF/Synapse instances? We have the need to scale this out to hundreds of SHIR containers.

Server 2019 Standard 1809 build 17763.3406

Dockerfile is latest with this addtion: RUN MD C:\Download ADD https://github.com/adoptium/temurin8-binaries/releases/download/jdk8u345-b01/OpenJDK8U-jdk_x64_windows_hotspot_8u345b01.zip C:/Download RUN MD "C:\Program Files\Eclipse Adoptium\jdk8u345-b01" RUN tar -xf C:/Download/OpenJDK8U-jdk_x64_windows_hotspot_8u345b01.zip -C "C:\Program Files\Eclipse Adoptium" RUN SETX PATH "%PATH%;C:\Program Files\Eclipse Adoptium\jdk8u345-b01\bin;C:\Program Files\Eclipse Adoptium\jdk8u345-b01\jre\bin\server" /m RUN SETX JAVA_HOME "C:\Program Files\Eclipse Adoptium\jdk8u345-b01\" /m

image

The only docker warning that is logged on the host server: Health check for container 39fbbf4f690da051145d18f9d4df16b6666108c76dd39cf73d177179bf961f60 error: context deadline exceeded

This show up on all the containers that are unhealthy `[09/22/2022 12:23:08] Registering SHIR node with the node key: redacted@ServiceEndpoint=usgovva.frontend.datamovement.azure.us@Vredacted

[09/22/2022 12:23:09] Registering SHIR node with the node name: redacted [09/22/2022 12:23:09] Registering SHIR node with the enable high availability flag: true

[09/22/2022 12:23:09] Registering SHIR node with the tcp port: 8060

[09/22/2022 12:25:54] Start registering a new SHIR node

[09/22/2022 12:25:54] Enable High Availability

[09/22/2022 12:25:54] Remote Access Port: 8060

[09/22/2022 12:31:59] Waiting 60 seconds for connecting

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

[09/22/2022 12:34:02] diahost.exe is not running

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

[09/22/2022 12:36:06] diahost.exe is not running

Get-WmiObject : Call was canceled by the message filter. (Exception from

[09/22/2022 12:38:09] diahost.exe is not running

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

[09/22/2022 12:40:11] diahost.exe is not running

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

[09/22/2022 12:42:12] diahost.exe is not running

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

[09/22/2022 12:44:12] diahost.exe is not running

+ CategoryInfo          : InvalidOperation: (:) [Get-WmiObject], COMExcept 

ion

+ FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands 

.GetWmiObjectCommand`

nickcva commented 4 months ago

I found a work around to allow mass deployments of SHIR containers. We are currently running about 80 SHIR containers.

Use "--isolation=hyperv " in your docker run command.

docker run -d --isolation=hyperv --restart unless-stopped --name="name" -e NODE_NAME="name" -e AUTH_KEY="key" -e ENABLE_HA=false -e HA_PORT=8060 -e ENABLE_AE=false -e AE_TIME=600 "someimage:latest"