bitsofinfo / hazelcast-docker-swarm-discovery-spi

Docker Swarm based discovery strategy SPI for Hazelcast enabled applications
Apache License 2.0
39 stars 33 forks source link

NullPointerException DelegatingAddressPicker.java:69 [hazelcast-3.11.1.jar:3.11.1] #25

Closed robinroos closed 5 years ago

robinroos commented 5 years ago

Hello,

Feb 13 10:16:37 CSINTFINSDCKS01 data-service[960]: 2019-02-13 10:16:37.844 INFO [ips-data-service,,,] 1 --- [ main] s.d.s.d.DockerDNSRRMemberAddressProvider : Resolved domain name 'data-service' to address(es): [data-service/10.0.5.112] Feb 13 10:16:37 CSINTFINSDCKS01 data-service[960]: 2019-02-13 10:16:37.859 ERROR [ips-data-service,,,] 1 --- [ main] com.hazelcast.instance.AddressPicker : [LOCAL] [test-ips-data-service-session] [3.11.1] null Feb 13 10:16:37 CSINTFINSDCKS01 data-service[960]: Feb 13 10:16:37 CSINTFINSDCKS01 data-service[960]: java.lang.NullPointerException: null Feb 13 10:16:37 CSINTFINSDCKS01 data-service[960]: #011at com.hazelcast.instance.DelegatingAddressPicker.validatePublicAddress(DelegatingAddressPicker.java:69) ~[hazelcast-3.11.1.jar:3.11.1]

I just deployed DNSRR for the first time in our Docker Swarm. Well, Jenkins deployed it for me....

I have seen the above snipped from the log which shows the embedded Hazelcast cache failing to bootstrap. The log shows that serviceName=data-service has been "resolved" to a set of just one address, 10.0.5.12. Subsequently, binding fails probably because 10.0.5.12 is not amongst the set of networkInterfaceAddresses identified by DockerDNSRRMemberAddressProvider.

I know this represents very little information to go on, but any ideas would be greatly appreciated.

Of note, I am not passing serviceName, servicePort or peerServicesCsv as variables but have hard coded them as literals in hazelcast.xml:

`data-service

5701 data-service:5701

`

I will attach the full hazelcast.xml, which is very similar to the minimalist DNSRR example suggested by this project.

Thanks, Robin.

robinroos commented 5 years ago

hazelcast.xml:

<?xml version="1.0" encoding="UTF-8"?>

<hazelcast xmlns="http://www.hazelcast.com/schema/config"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.hazelcast.com/schema/config
           http://www.hazelcast.com/schema/config/hazelcast-config-3.11.xsd">

    <group>
        <name>${cas.env}-ips-data-service-session</name>
        <password>ignored</password>
    </group>
    <management-center enabled="true">http://springboot-hazelcast:8080/hazelcast-mancenter</management-center>
    <network>
        <!--
        Auto-increment is turned off for the port; docker containers will
        always be available at the available in-network ports.
        -->
        <port auto-increment="false">5701</port>
        <member-address-provider enabled="true">
            <class-name>org.bitsofinfo.hazelcast.spi.docker.swarm.dnsrr.DockerDNSRRMemberAddressProvider</class-name>
            <properties>
                <!-- Name of the docker service that this instance is running in -->
                <property name="serviceName">data-service</property>

                <!-- Internal port that hazelcast is listening on -->
                <property name="servicePort">5701</property>
            </properties>
        </member-address-provider>
        <outbound-ports>
            <!--
            Allowed port range when connecting to other nodes.
            0 or * means use system provided port.
            -->
            <ports>0</ports>
        </outbound-ports>
        <join>
            <multicast enabled="false"/>
            <tcp-ip enabled="false"/>
            <aws enabled="false"/>
            <gcp enabled="false"/>
            <azure enabled="false"/>
            <kubernetes enabled="false"/>
            <eureka enabled="false"/>
            <discovery-strategies>
                <discovery-strategy
                        enabled="true"
                        class="org.bitsofinfo.hazelcast.spi.docker.swarm.dnsrr.discovery.DockerDNSRRDiscoveryStrategy"
                >
                    <properties>
                        <!--
                            Comma separated list of docker services and associated ports
                            to be considered peers of this service.
                            Note, this must include itself (the definition of
                            serviceName and servicePort) if the service is to
                            cluster with other instances of this service.
                        -->
                        <property name="peerServicesCsv">data-service:5701</property>
                    </properties>
                </discovery-strategy>
            </discovery-strategies>
        </join>
    </network>
</hazelcast>
robinroos commented 5 years ago

data-service definition:

data-service:
    image: “…/ips-data-service:9"
    deploy:
      mode: replicated
      replicas: 2
      restart_policy:
        condition: on-failure
        delay: 30s
        max_attempts: 5
        window: 300s
      update_config:
        parallelism: 2
        delay: 15s
        order: stop-first
        monitor: 20s
        failure_action: rollback
      rollback_config:
        parallelism: 2
        delay: 15s
        order: stop-first
        monitor: 20s
        failure_action: pause
      placement:
        constraints:
          - node.labels.secure==true
    networks:
      - ips_network
      - spring_svc_network
    environment:
      - SPRING_PROFILES_ACTIVE=test
    logging:
      driver: syslog
      options:
        tag: data-service
robinroos commented 5 years ago

It seems that the following condition, in DockerDNSRRMemberAddressProvider (104-106), is not being met:

                        if(
                            potentialInetAddresses.contains(address)
                        ) 

leaving the constructed instance as still having bindAddress = null.

bitsofinfo commented 5 years ago

Posting the full stack trace would be helpful.

@Cardds can you assist please?

robinroos commented 5 years ago

I see the logs only through Kibana, which I met only today. Attached is one of today's logs which includes the full stack trace - the Spring logo is well-represented which gives me faith that the individual log lines are correctly sequenced, so hopefully the stack trace is in-tact.

DNSRR_NPE_Log_20190213_122930.txt

Cardds commented 5 years ago

Seems like a prime opportunity to introduce a logging statement. See #26.

I can't say off the top of my head exactly why it would pick up an IP address for the service that wasn't in the network interfaces, at least not without some other configuration going on.

Below is Java source that can be used to print out IPs picked up from the network interfaces, which can be used for comparison if the above PR isn't used.

import java.net.InetAddress;
import java.net.NetworkInterface;
import java.net.SocketException;
import java.util.Enumeration;

public class PrintNetworkDetails {

    static final String NAME = "data-service";

    public static void main(String[] args) {
        try {
            InetAddress address;
            Enumeration<InetAddress> networkInterfaceAddresses;
            Enumeration<NetworkInterface> networkInterfaces = NetworkInterface.getNetworkInterfaces();

            while(networkInterfaces.hasMoreElements()) {
                networkInterfaceAddresses =
                    networkInterfaces.nextElement().getInetAddresses();

                while(networkInterfaceAddresses.hasMoreElements()) {
                    address = networkInterfaceAddresses.nextElement();
                    System.out.println(address.toString());
                }
            }
        } catch (SocketException e) {
            e.printStackTrace();
        }
    }

}
robinroos commented 5 years ago

Here is the Diagnostic log. I put the printing of address information into the top of the SpringBootApplication class so that it appears in the log just before the "Spring" logo.

I'm happy to add further diagnostics as required. DNSRR_NPE_Diagnostic_Log_20190214_095156.txt

robinroos commented 5 years ago

Relevant portions of the Diagnostic log:

Feb 14 09:51:47 CSINTFINSDCKS01 data-service[952]: start printNetworkDetails()
Feb 14 09:51:47 CSINTFINSDCKS01 data-service[952]: /172.18.0.15
Feb 14 09:51:47 CSINTFINSDCKS01 data-service[952]: /10.0.4.30
Feb 14 09:51:47 CSINTFINSDCKS01 data-service[952]: /10.0.5.24
Feb 14 09:51:47 CSINTFINSDCKS01 data-service[952]: /127.0.0.1
Feb 14 09:51:47 CSINTFINSDCKS01 data-service[952]: end printNetworkDetails()

Feb 14 09:52:13 CSINTFINSDCKS01 data-service[952]: 2019-02-14 09:52:13.681 INFO [ips-data-service,,,] 1 --- [ main] s.d.s.d.DockerDNSRRMemberAddressProvider : Resolved domain name 'data-service' to address(es): [data-service/10.0.5.112]

Feb 14 09:52:13 CSINTFINSDCKS01 data-service[952]: 2019-02-14 09:52:13.697 ERROR [ips-data-service,,,] 1 --- [      main] com.hazelcast.instance.AddressPicker   : [LOCAL] [test-ips-data-service-session] [3.11.1] null
Feb 14 09:52:13 CSINTFINSDCKS01 data-service[952]: 
Feb 14 09:52:13 CSINTFINSDCKS01 data-service[952]: java.lang.NullPointerException: null
Feb 14 09:52:13 CSINTFINSDCKS01 data-service[952]: #011at com.hazelcast.instance.DelegatingAddressPicker.validatePublicAddress(DelegatingAddressPicker.java:69) ~[hazelcast-3.11.1.jar:3.11.1]
robinroos commented 5 years ago

I am logging via the static method entry-point PrintNetworkDetails.printNetworkDetails():

package uk.gov.insolvency.dataservices;

import java.net.InetAddress;
import java.net.NetworkInterface;
import java.net.SocketException;
import java.util.Enumeration;

public class PrintNetworkDetails {

    public static void main(String[] args) {
        printNetworkDetails();
    }

    public static void printNetworkDetails() {
        try {
            System.out.println("start printNetworkDetails()");
            InetAddress address;
            Enumeration<InetAddress> networkInterfaceAddresses;
            Enumeration<NetworkInterface> networkInterfaces = NetworkInterface.getNetworkInterfaces();

            while(networkInterfaces.hasMoreElements()) {
                networkInterfaceAddresses =
                        networkInterfaces.nextElement().getInetAddresses();

                while(networkInterfaceAddresses.hasMoreElements()) {
                    address = networkInterfaceAddresses.nextElement();
                    System.out.println(address.toString());
                }
            }
            System.out.println("end printNetworkDetails()");

        } catch (SocketException e) {
            e.printStackTrace();
        }
    }

}
robinroos commented 5 years ago

Sorry, accidentally pressed Close and Comment!

robinroos commented 5 years ago

What would be most useful to me at this stage would be a fixed version at jcenter() which prevented the NPE from happening (and logged around that). With such a patch I could leave the DNSRR discovery "enabled" with the services scaled to only 1 instance each whilst we continued to investigate the real problem.

Until then I will have to regress the deployment of DNSRR discovery.

bitsofinfo commented 5 years ago

@Cardds is the NPE @robinroos fixable by your contribution or is it occurring upstream from us?

robinroos commented 5 years ago

Would it be viable to default the bind address to 127.0.0.1 if there is no actual match?

robinroos commented 5 years ago

Clearly not. The bond address must be a public (non-loyal) address, as enforced by DelegatingAddressPicker.validatePublicAddress.

Perhaps the first entry in the list of “resolved” addresses for the service name should be the one that is used if there is no match. And with logging around that. With such a patch in place I would’ve would be able to continue investigations as to precisely why that address is not returned from the network interface.

robinroos commented 5 years ago

Eureka! In explaining this to another person I think that I now understand the problem at least.

Service name "data-service" was Scaled at TWO (instances in the DockerSwarm) but was resolved to only ONE ip address. I believe that the resolve has identified the Load Balancer's ip address, behind which sit all (both) of the "data-service" instances. As such, the resolved ip would never be amongst the ip addresses "of the node on which this data-service instance resides".

Would that make sense?

If so, perhaps that gets us all closer to a solution....

bitsofinfo commented 5 years ago

https://bintray.com/bitsofinfo/maven/hazelcast-docker-swarm-discovery-spi/1.0-RC10 is available with the PR logging from @Cardds

Yes sounds like it might be picking up the swarm ingress VIP for the service?

Cardds commented 5 years ago

@robinroos To confirm, the endpoint-mode for the associated docker network(s) is dnsrr? The DNSRR solution only works for dnsrr docker networks. This mode means the docker routing mesh is not used, so each docker container is assigned a unique public IP address that the docker DNS resolves the service name to in a round-robin fashion.

e.g. resolving data-service with two instances would return two IP addresses from the DNS, instead of one.

How the DNSRR method works is that it queries the DNS for the service-name, and can acquire a deterministic list of the addresses of its peers. The downside of this is that a dedicated load-balancer (software or hardware) may be needed for the service.

SwarmMemberAddressProvider should be used if the docker routing mesh is enabled.

robinroos commented 5 years ago

In this instance I do not believe it is the Ingress VIP, since we are talking about balancing requests which originate internally (an embedded Hz cluster), not externally.

“So when a Service is created, it get a virtual IP address right away, on the Service’s network. As we said before in the Service Discovery part, when a Service is requested the resulting DNS query is forwarded to the Docker Engine, which in turn returns the IP of the service, a virtual IP. Traffic sent to that virtual IP is load balanced to all of the healthy containers of that service on the network. All the load balancing is done by Docker, since only one entry-point is given to the client (one IP).” (https://blog.octo.com/en/how-does-it-work-docker-part-3-load-balancing-service-discovery-and-security/)

So, how do we ascertain that this is indeed the case - perhaps my sharing our network definitions would help?

And is DNSRR the correct cluster node discovery mode to use in this context (Docker Engine itself acting as load balancer over the clustered apps)?

robinroos commented 5 years ago

@Cardds, we both posted at the same time.

I was previously informed that DNSRR “was supported” across our Docker Swarm network(s), and so opted to deploy the DNSRR discovery mode.

I am now better informed, and I think I can ask more precise questions of our infrastructure people.

I will get back to you tomorrow.

Thanks, Robin.

robinroos commented 5 years ago

It might be worth further enhancing DockerDNSRRMemberAddressProvider such that, if none of the resolved addresses is amongst the non-local IPs of the network interface, a message is logged suggesting that DNSRR might not be the appropriate cluster node discovery mode for the extant network topology.

robinroos commented 5 years ago

Thank you so much for your assistance regarding my issue. I will tomorrow reconfigure based on SwarmMemberAddressProvider.

Kind regards, Robin.

bitsofinfo commented 5 years ago

cool please star the project if its of use to you. good luck