TNO / knowledge-engine

Improves interoperability between systems (i.e. devices, platforms, apps, databases) by exchanging data based on their semantics
https://knowledge-engine.eu
Apache License 2.0
33 stars 4 forks source link

Adding two or more SC in separate KERs sometimes should not sometimes cause them to not register the other. #551

Open bnouwt opened 5 days ago

bnouwt commented 5 days ago

In the TDI-500 docker compose project, a race condition seemed to occur where three knowledge mappers started up at the same time and registered their smart connector (in separate KERs) at the same time and this sometimes caused them to miss the registration of the other and this did not automatically fix itself after some time.

We think this is caused by a timing issue where SC A asks (at startup) which other SCs are already in the network and gets no response from SC B because SC B is not yet fully started. This should not be a problem, because every SC should notify all others of its existence by using a Post KI, but when SC B posts this notification SC A is not yet ready to receive this message and also does not register SC B in that way. So, SC A will never know that SC B exists.

There have been issues with this before that were partly fixed, but apparently it is not fully fixed. There is a workaround that @kadevgraaf-tno can attach to this issue, until we have fixed the underlying issue of SC startup.

kadevgraaf-tno commented 5 days ago

I think I have a workaround by using Docker healthcheck and depends_on for starting knowledge mappers in succession/sequentially:

services:
  kd:
    image: ghcr.io/tno/knowledge-engine/knowledge-directory:1.2.3

  service1-km:
    build: ./service1-km
    environment:
      - SERVICE1_CLIENT_ID
      - SERVICE1_CLIENT_SECRET
      - SERVICE1_REFRESH_TOKEN
    restart: always
    healthcheck:
        test: ["CMD", "curl", "-f", "http://service1-sc:8280/rest/sc"]
        interval: 10s
        timeout: 5s
        retries: 5      

  service1-sc:
    image: ghcr.io/tno/knowledge-engine/smart-connector:1.2.4
    environment:
      HOSTNAME: service1-sc
      PORT: 8081 # The (knowledge engine internal) port that is used for inter-runtime communication
      KE_RUNTIME_EXPOSED_URL: http://service1-sc:8081/
      KE_RUNTIME_PORT: 8081
      KD_URL: http://kd:8282/
    restart: always          

  service2-km:
    build: ./service2-km
    environment:
      - TS_EMAIL
      - TS_PASSWORD
      - INTERFACE_ID
    restart: always
    healthcheck:
        test: ["CMD", "curl", "-f", "http://service2-sc:8280/rest/sc"]
        interval: 10s
        timeout: 5s
        retries: 5        
    depends_on:
      service1-km:
        condition: service_healthy

  service2-sc:
    image: ghcr.io/tno/knowledge-engine/smart-connector:1.2.4
    environment:
      HOSTNAME: service2-sc
      PORT: 8081 # The (knowledge engine internal) port that is used for inter-runtime communication
      KE_RUNTIME_EXPOSED_URL: http://service2-sc:8081/
      KE_RUNTIME_PORT: 8081
      KD_URL: http://kd:8282/
    restart: always
    depends_on:
      service1-km:
        condition: service_healthy      

  service3-km:
    build: ./service3-km
    environment:
      - SERVICE3_ACCESS_TOKEN
    restart: always
    healthcheck:
        test: ["CMD", "curl", "-f", "http://service3-sc:8280/rest/sc"]
        interval: 10s
        timeout: 5s
        retries: 5        
    depends_on:
      service2-km:
        condition: service_healthy

  service3-sc:
    image: ghcr.io/tno/knowledge-engine/smart-connector:1.2.4
    environment:
      HOSTNAME: service3-sc
      PORT: 8081 # The (knowledge engine internal) port that is used for inter-runtime communication
      KE_RUNTIME_EXPOSED_URL: http://service3-sc:8081/
      KE_RUNTIME_PORT: 8081
      KD_URL: http://kd:8282/
    restart: always
    depends_on:
      service2-km:
        condition: service_healthy      

  service4-km:
    build: ./service4-km
    environment:
      - SERVICE4_CLIENT_ID
      - SERVICE4_SECRET
      - SERVICE4_REFRESH_TOKEN
      - SERVICE4_SUBSCRIPTION_KEY
    restart: always
    depends_on:
      service3-km:
        condition: service_healthy
    healthcheck:
        test: ["CMD", "curl", "-f", "http://service3-sc:8280/rest/sc"]
        interval: 10s
        timeout: 5s
        retries: 5          

  service4-sc:
    image: ghcr.io/tno/knowledge-engine/smart-connector:1.2.4
    environment:
      HOSTNAME: service4-sc
      PORT: 8081 # The (knowledge engine internal) port that is used for inter-runtime communication
      KE_RUNTIME_EXPOSED_URL: http://service4-sc:8081/
      KE_RUNTIME_PORT: 8081
      KD_URL: http://kd:8282/
    restart: always
    depends_on:
      service3-km:
        condition: service_healthy