ClickHouse / clickhouse-java

ClickHouse Java Clients & JDBC Driver
https://clickhouse.com
Apache License 2.0
1.45k stars 534 forks source link

Health check not working properly #1384

Open wang-qijia opened 1 year ago

wang-qijia commented 1 year ago

Describe the bug

The memory limitation of the ClickHouse service on a single node affects insertion, and after restarting ClickHouse, the application service does not recover and keeps looking for an available node. After the health check, it should switch from the faulty node to the healthy node.

Steps to reproduce

  1. The application service runs continuously to insert data into ClickHouse.
  2. Restart the ClickHouse node.
  3. Observe whether the application service can insert data into ClickHouse.

Expected behaviour

After the clickhouse restarts, the application service should return to normal after observing for a period of time

Code example

        Properties properties = new Properties(config.getOptions());
        properties.setProperty(ClickHouseDefaults.USER.getKey(), config.getUser());
        properties.setProperty(ClickHouseDefaults.PASSWORD.getKey(), config.getPassword());

        properties.setProperty(ClickHouseClientOption.AUTO_DISCOVERY.getKey(),
            String.valueOf(config.getAutoDiscovery()));
        if (config.getAutoDiscovery()) {
            properties
                .setProperty(ClickHouseClientOption.NODE_DISCOVERY_INTERVAL.getKey(), "30000");
        }
        properties.setProperty(ClickHouseClientOption.LOAD_BALANCING_POLICY.getKey(), "roundRobin");
        properties.setProperty(ClickHouseClientOption.HEALTH_CHECK_INTERVAL.getKey(), "30000");
        properties.setProperty(ClickHouseClientOption.FAILOVER.getKey(), "3");
        properties.setProperty(ClickHouseClientOption.CHECK_ALL_NODES.getKey(), "1");
        properties.setProperty(ClickHouseClientOption.MAX_QUEUED_BUFFERS.getKey(), "0");
        properties.setProperty(ClickHouseClientOption.DECOMPRESS.getKey(), "1");
        try {
            return new ClickHouseDataSource(
                config.getEndpoint(), properties);
        } catch (SQLException e) {
            throw new RuntimeException("Failed to create Clickhouse data source", e);
        }
    }

Error log

Caused by: java.sql.SQLException: Failed to get single-node at com.clickhouse.jdbc.SqlExceptionUtils.clientError(SqlExceptionUtils.java:81) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.internal.ClickHouseConnectionImpl.(ClickHouseConnectionImpl.java:258) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.ClickHouseDataSource.getConnection(ClickHouseDataSource.java:46) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.ClickHouseDataSource.getConnection(ClickHouseDataSource.java:16) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) ~[spring-jdbc-5.3.4.jar:5.3.4] at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) ~[spring-jdbc-5.3.4.jar:5.3.4] at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ~[spring-jdbc-5.3.4.jar:5.3.4] ... 9 more Caused by: java.lang.IllegalArgumentException: ClickHouseNodes [checking=false, index=0, lock=r1w0, nodes=0, faulty=1, policy=RoundRobinPolicy, tags=[]]@305219537 does not contain suitable node for ClickHouseNodeSelector [protocols=[], tags=[]] at com.clickhouse.client.ClickHouseLoadBalancingPolicy$RoundRobinPolicy.get(ClickHouseLoadBalancingPolicy.java:94) ~[clickhouse-cli-client-0.3.2-patch11-shaded.jar:clickhouse-cli-client 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.client.ClickHouseNodes.apply(ClickHouseNodes.java:514) ~[clickhouse-cli-client-0.3.2-patch11-shaded.jar:clickhouse-cli-client 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.internal.ClickHouseConnectionImpl.(ClickHouseConnectionImpl.java:256) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.ClickHouseDataSource.getConnection(ClickHouseDataSource.java:46) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at com.clickhouse.jdbc.ClickHouseDataSource.getConnection(ClickHouseDataSource.java:16) ~[clickhouse-jdbc-0.3.2-patch11.jar:clickhouse-jdbc 0.3.2-patch11 (revision: 27f8951)] at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) ~[spring-jdbc-5.3.4.jar:5.3.4] at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) ~[spring-jdbc-5.3.4.jar:5.3.4] at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ~[spring-jdbc-5.3.4.jar:5.3.4]

reason

The inconsistency between the nodes checked for health and the ones receiving actual requests is likely caused by the following code.

1. 
 public void check() {
        // exclusive access
        if (!checking.compareAndSet(false, true)) {
            return;
        }

        Set<ClickHouseNode> list = new LinkedHashSet<>();
        long currentTime = System.currentTimeMillis();
        boolean checkAll = template.config.getBoolOption(ClickHouseClientOption.CHECK_ALL_NODES);
        int healthyNodeStartIndex = -1;
        lock.readLock().lock();
        // TODO:
        // 1) minimize the list;
        // 2) detect flaky node and check it again later in a less frequent way
        try {
            pickNodes(faultyNodes, selector, list, groupSize, currentTime);
            if (checkAll) {
                healthyNodeStartIndex = list.size();
                pickNodes(nodes, selector, list, groupSize, currentTime);
            }
        } finally {
            checking.set(false);
            lock.readLock().unlock();
        }

        boolean hasFaultyNode = false;
        int count = 0;
        try {
            for (ClickHouseNode node : list) {
                ClickHouseNode n = node.probe();
                // probe is faster than ping but it cannot tell if the server works or not
                boolean isAlive = false;
                try (ClickHouseClient client = ClickHouseClient.builder().agent(false).config(n.config)
                        .nodeSelector(ClickHouseNodeSelector.of(n.getProtocol())).build()) {
                    isAlive = client.ping(n, n.config.getConnectionTimeout());
                } catch (Exception e) {
                    // ignore
                }
                if (!n.equals(node)) {
                    update(n, Status.MANAGED);
                    update(node, Status.STANDALONE);
                }

                if (isAlive) {
                    if (healthyNodeStartIndex < 0 || count < healthyNodeStartIndex) {
                        update(n, Status.HEALTHY);
                    }
                } else {
                    hasFaultyNode = true;
                    if (healthyNodeStartIndex >= count) {
                        update(n, Status.FAULTY);
                    }
                }
                count++;
            }
        } catch (Exception e) {
            log.warn("Unexpected error occurred when checking node status", e);
        } finally {
            if (checkAll || hasFaultyNode) {
                scheduleHealthCheck();
            }
        }
    }
2. 
    default boolean ping(ClickHouseNode server, int timeout) {
        if (server != null) {
            if (timeout < 1) {
                timeout = server.config.getConnectionTimeout();
            }
            if (server.getProtocol() == ClickHouseProtocol.ANY) {
                server = ClickHouseNode.probe(server.getHost(), server.getPort(), timeout);
            }
            try (ClickHouseResponse resp = read(server) // create request
                    .option(ClickHouseClientOption.ASYNC, false) // use current thread
                    .option(ClickHouseClientOption.CONNECTION_TIMEOUT, timeout)
                    .option(ClickHouseClientOption.SOCKET_TIMEOUT, timeout)
                    .option(ClickHouseClientOption.BUFFER_SIZE, 8) // actually 4 bytes should be enough
                    .option(ClickHouseClientOption.MAX_QUEUED_BUFFERS, 1) // enough with only one buffer
                    .format(ClickHouseFormat.TabSeparated)
                    .query("SELECT 1 FORMAT TabSeparated").execute()
                    .get(timeout, TimeUnit.MILLISECONDS)) {
                return resp != null;
            } catch (Exception e) {
                // ignore
            }
        }

        return false;
    }
3. 
    public final ClickHouseNode getServer() {
        ClickHouseNode node = serverRef.get();
        if (node == null) {
            node = server.apply(getClient().getConfig().getNodeSelector());
            if (!serverRef.compareAndSet(null, node)) {
                node = serverRef.get();
            }
        }
        return node;
    }

Configuration

Environment

ClickHouse server

### Tasks
zhicwu commented 1 year ago

Thanks for the report @wang-qijia. It is indeed looks like a bug, I'll see if I can reproduce the issue at my end.

wang-qijia commented 1 year ago

@zhicwu Hi, is there any progress on this matter?