bitsofinfo / hazelcast-docker-swarm-discovery-spi

Docker Swarm based discovery strategy SPI for Hazelcast enabled applications
Apache License 2.0
39 stars 33 forks source link

dnsrr discovery method does not work when "healthcheck" used #47

Open mnoky opened 5 years ago

mnoky commented 5 years ago

Great project, I'm excited to get this working for a service I have deployed in a docker cluster! Currently testing 1.0-RC14 and I've hit the following snag:

The dnsrr discovery method does not work when a docker "healthcheck" is used. Reason being: during startup, the service name cannot be resolved. The name is not available until after the healthcheck succeeds and the service is up and running. Thus, it is a bit of a chicken-and-egg problem. The following exception is thrown at startup and the service cannot start (only relevant lines shown)

Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.hazelcast.core.HazelcastInstance]: Factory method 'hazelcastInstance' threw exception; nested exception is com.hazelcast.config.ConfigurationException: Cannot create a new instance of MemberAddressProvider 'class org.bitsofinfo.hazelcast.spi.docker.swarm.dnsrr.DockerDNSRRMemberAddressProvider'
...
Caused by: com.hazelcast.config.ConfigurationException: Cannot create a new instance of MemberAddressProvider 'class org.bitsofinfo.hazelcast.spi.docker.swarm.dnsrr.DockerDNSRRMemberAddressProvider'
...
    at com.hazelcast.instance.DefaultNodeContext.newMemberAddressProviderInstance(DefaultNodeContext.java:94)
    ... 63 more
    Caused by: java.net.UnknownHostException: my_service: Name or service not known
...
    at org.bitsofinfo.hazelcast.spi.docker.swarm.dnsrr.DockerDNSRRMemberAddressProvider.resolveServiceName(DockerDNSRRMemberAddressProvider.java:130)

When I disable the healthcheck for my service, the dns resolution works right away and there are no problems.

Is it possible to delay the dns lookup in DockerDNSRRMemberAddressProvider? Or does it need to be available right away?

mnoky commented 5 years ago

Looks like others have encountered this problem as well:

https://github.com/moby/moby/issues/35451

bitsofinfo commented 5 years ago

I don't think there is a way to do this out of the box. I think @Cardds would have to add an option for some kind of artificial sleep for such a thing, but I'm not sure even that would be reliable. @Cardds ?

mnoky commented 5 years ago

I haven't yet tried the DockerSwarmDiscoveryStrategy + SwarmMemberAddressProvider solution. I'm guessing the use of healthcheck will also be problematic here... Do you know offhand if this would be the case?

bitsofinfo commented 5 years ago

That method uses the actual swarm APIs to discover peers, so its not reliant on the auto-generated swarm peer level host/dns like DockerDNSRRMemberAddressProvider method. So it should work.

bitsofinfo commented 5 years ago

btw @mnoky, on that moby issue, I highly doubt that issue will ever be resolved. They've seemingly abandoned swarm to minimal maintenance mode at this point.

vinsgithub commented 2 years ago

Hi @bitsofinfo @mnoky, I've found a workaround for that in my scenario (not necessarily covers all) and I hope could help someone.

Little notice about swarm: Swarm is not dead and still maintained. In some part also evolved by Mirantis because lots of companies are still using it. After 2019 many things have changed, sure, but swarm is still out there for those who don't need kubernetes and cloud services in general.

To overcome the initialization problem in my springboot (jhipster) microservice using your awesome hazelcast-docker-swarm solution, I've set this in my docker-compose:

   healthcheck:
      test: (echo 'exit' | curl -v telnet://localhost:8082 2>&1 | grep -c refused > /dev/null) || (curl -sS http://localhost:8082/management/health | grep -c UP > /dev/null)
      interval: 5s
      timeout: 30s
      retries: 4
      start_period: 3s #must be less than JHIPSTER_SLEEP

The rational behind this is that application needs to resolve docker service name during startup but swarm healthcheck does not allow it until healthcheck itself is ok. So we first allow healthcheck to be initially ok if local service port (8082) is refusing connection (application is starting) but as soon as local port is responding, healthcheck with test the real application check output. It's not ideal but it's a good compromise.