Azure / Azure-Data-Factory-Integration-Runtime-in-Windows-Container

Azure Data Factory Integration Runtime in Windows Container Sample
MIT License
25 stars 36 forks source link

SHIR unable to reconnect to Synapse after forced restart #20

Closed Mats-Elfving closed 9 months ago

Mats-Elfving commented 9 months ago

Problem: when container terminates non-gracefully it is unable to connect to synapse during restart

Background: I have setup the container as suggested, with below alterations

  1. Denodo ODBC
  2. Oracle Instant Client Basic Lite
  3. VS 2017 redistributable (required for Oracle Instant Client / ODBC)
  4. Oracle ODBC
  5. Oracle Instant Client sqlplus tool
  6. Generated a few DSNs from above
  7. added server certificate

It is running fine on an AKS cluster and the deployment is controlled using ArgoCd. When I (or AKS backend) terminates/deletes the pod is restarted by ArgoCd. This would end up in an error "Registration of new node is forbidden when Remote Access is disabled on another node. To enable it, you can login the machine where the other node is installed and run 'dmgcmd.exe -EnableRemoteAccess "" [""]'."

As a work-around I am able to delete the node in synapse and then restart the pod. With these steps it is able to re-connect . How can I get around this problem

jikuja commented 9 months ago

Which environment variable do you set when starting container?

Mats-Elfving commented 9 months ago

Hi, This is an extract from the deployment yaml ,

`

Container Environment variables:

    env:
    - name: AUTH_KEY
      valueFrom:
        secretKeyRef:
          key: SECRET_AUTH_KEY_1
          name: app-secrets
    - name: NODE_NAME # name as registered in synapse
      value: "t-nx-mirage-synapse-shir"
    - name: ENABLE_HA # The flag to enable high availability and scalability. It supports up to 4 nodes registered to the same IR when HA is enabled, otherwise only 1 is allowed.
      value: "false"  # default is false
    - name: HA_PORT   # The port to set up a high availability cluster.
      value: "8060"   # default is 8060
    - name: ENABLE_AE # The flag to enable offline nodes auto-expiration. If enabled, the expired nodes will be removed automatically from the IR when a new node is attempting to register. 
                      # Only works when ENABLE_HA=true.
      value: "false"  # default is false
    - name: AE_TIME   # The expiration timeout duration for offline nodes in seconds. Should be no less than 600 (10 minutes).
      value: "600"    # Least time = 10 minutes = 600 seconds

`

byran77 commented 9 months ago

Hi, This is an extract from the deployment yaml ,

`

Container Environment variables:

    env:
    - name: AUTH_KEY
      valueFrom:
        secretKeyRef:
          key: SECRET_AUTH_KEY_1
          name: app-secrets
    - name: NODE_NAME # name as registered in synapse
      value: "t-nx-mirage-synapse-shir"
    - name: ENABLE_HA # The flag to enable high availability and scalability. It supports up to 4 nodes registered to the same IR when HA is enabled, otherwise only 1 is allowed.
      value: "false"  # default is false
    - name: HA_PORT   # The port to set up a high availability cluster.
      value: "8060"   # default is 8060
    - name: ENABLE_AE # The flag to enable offline nodes auto-expiration. If enabled, the expired nodes will be removed automatically from the IR when a new node is attempting to register. 
                      # Only works when ENABLE_HA=true.
      value: "false"  # default is false
    - name: AE_TIME   # The expiration timeout duration for offline nodes in seconds. Should be no less than 600 (10 minutes).
      value: "600"    # Least time = 10 minutes = 600 seconds

`

Hi @Mats-Elfving When the container crashed, the former node was not detached automatically and resulted in the problem. Please just turn on the flags ENABLE_HA and ENABLE_AE to true. This will let your integration runtime permit registering the restarted container and remove the former node automatically after the duration specified by AE_TIME (default is 10 minutes).

Mats-Elfving commented 9 months ago

Reading the comment on the same, I realize that you are of course right. Would this (HA_PORT) also allow me to deploy two (or more) pods (with different node-names)? How is the HA_PORT used? Is it an inter-pod communication port? Or a port that should be open for communications from out-of-cluster? or only within service?.

byran77 commented 9 months ago

Reading the comment on the same, I realize that you are of course right. Would this (HA_PORT) also allow me to deploy two (or more) pods (with different node-names)? How is the HA_PORT used? Is it an inter-pod communication port? Or a port that should be open for communications from out-of-cluster? or only within service?.

sure, HA_PORT is originally designed for multi-nodes deployment. It is used only among the nodes in the same cluster and in your scenario, it is an inter-pod port.

Mats-Elfving commented 9 months ago

Thank you for the help setting this up. I ended up using the below settings. This would ensure a name that is relevant and at the time unique.

`

Container Environment variables:

    env:
    - name: AUTH_KEY
      valueFrom:
        secretKeyRef:
          key: SECRET_AUTH_KEY_1
          name: app-secrets
    - name: NODE_NAME # name as registered in synapse
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: ENABLE_HA # The flag to enable high availability and scalability. It supports up to 4 nodes registered to the same IR when HA is enabled, otherwise only 1 is allowed.
      value: "true"  # default is false
    - name: HA_PORT   # The port to set up a high availability cluster.
      value: "8060"   # default is 8060
    - name: ENABLE_AE # The flag to enable offline nodes auto-expiration. If enabled, the expired nodes will be removed automatically from the IR when a new node is attempting to register. 
                      # Only works when ENABLE_HA=true.
      value: "true"  # default is false
    - name: AE_TIME   # The expiration timeout duration for offline nodes in seconds. Should be no less than 600 (10 minutes).
      value: "600"    # Least time = 10 minutes = 600 seconds
    - name: ORACLE_HOME
      value: "C:\\Oracle\\instantclient_21_11"
    - name: TNS_ADMIN
      value: "C:\\Oracle\\instantclient_21_11\\network\\admin"

`

Mats-Elfving commented 9 months ago

closing this ! :)