Open bnouwt opened 5 days ago
I think I have a workaround by using Docker healthcheck and depends_on for starting knowledge mappers in succession/sequentially:
services:
kd:
image: ghcr.io/tno/knowledge-engine/knowledge-directory:1.2.3
service1-km:
build: ./service1-km
environment:
- SERVICE1_CLIENT_ID
- SERVICE1_CLIENT_SECRET
- SERVICE1_REFRESH_TOKEN
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://service1-sc:8280/rest/sc"]
interval: 10s
timeout: 5s
retries: 5
service1-sc:
image: ghcr.io/tno/knowledge-engine/smart-connector:1.2.4
environment:
HOSTNAME: service1-sc
PORT: 8081 # The (knowledge engine internal) port that is used for inter-runtime communication
KE_RUNTIME_EXPOSED_URL: http://service1-sc:8081/
KE_RUNTIME_PORT: 8081
KD_URL: http://kd:8282/
restart: always
service2-km:
build: ./service2-km
environment:
- TS_EMAIL
- TS_PASSWORD
- INTERFACE_ID
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://service2-sc:8280/rest/sc"]
interval: 10s
timeout: 5s
retries: 5
depends_on:
service1-km:
condition: service_healthy
service2-sc:
image: ghcr.io/tno/knowledge-engine/smart-connector:1.2.4
environment:
HOSTNAME: service2-sc
PORT: 8081 # The (knowledge engine internal) port that is used for inter-runtime communication
KE_RUNTIME_EXPOSED_URL: http://service2-sc:8081/
KE_RUNTIME_PORT: 8081
KD_URL: http://kd:8282/
restart: always
depends_on:
service1-km:
condition: service_healthy
service3-km:
build: ./service3-km
environment:
- SERVICE3_ACCESS_TOKEN
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://service3-sc:8280/rest/sc"]
interval: 10s
timeout: 5s
retries: 5
depends_on:
service2-km:
condition: service_healthy
service3-sc:
image: ghcr.io/tno/knowledge-engine/smart-connector:1.2.4
environment:
HOSTNAME: service3-sc
PORT: 8081 # The (knowledge engine internal) port that is used for inter-runtime communication
KE_RUNTIME_EXPOSED_URL: http://service3-sc:8081/
KE_RUNTIME_PORT: 8081
KD_URL: http://kd:8282/
restart: always
depends_on:
service2-km:
condition: service_healthy
service4-km:
build: ./service4-km
environment:
- SERVICE4_CLIENT_ID
- SERVICE4_SECRET
- SERVICE4_REFRESH_TOKEN
- SERVICE4_SUBSCRIPTION_KEY
restart: always
depends_on:
service3-km:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://service3-sc:8280/rest/sc"]
interval: 10s
timeout: 5s
retries: 5
service4-sc:
image: ghcr.io/tno/knowledge-engine/smart-connector:1.2.4
environment:
HOSTNAME: service4-sc
PORT: 8081 # The (knowledge engine internal) port that is used for inter-runtime communication
KE_RUNTIME_EXPOSED_URL: http://service4-sc:8081/
KE_RUNTIME_PORT: 8081
KD_URL: http://kd:8282/
restart: always
depends_on:
service3-km:
condition: service_healthy
In the TDI-500 docker compose project, a race condition seemed to occur where three knowledge mappers started up at the same time and registered their smart connector (in separate KERs) at the same time and this sometimes caused them to miss the registration of the other and this did not automatically fix itself after some time.
We think this is caused by a timing issue where SC A asks (at startup) which other SCs are already in the network and gets no response from SC B because SC B is not yet fully started. This should not be a problem, because every SC should notify all others of its existence by using a Post KI, but when SC B posts this notification SC A is not yet ready to receive this message and also does not register SC B in that way. So, SC A will never know that SC B exists.
There have been issues with this before that were partly fixed, but apparently it is not fully fixed. There is a workaround that @kadevgraaf-tno can attach to this issue, until we have fixed the underlying issue of SC startup.