Improve the error message/logs for failed pull queries (All nodes are dead)

agavra commented 3 years ago

I was looking at some logs and for a cluster that has no lag and only a single host (which was alive at the time) I was seeing the following exception:

Unable to execute pull query. All nodes are dead or exceed max allowed lag.

The error message doesn't help me understand what happened, which partition failed or if there's some other issue. We should print more debugging information.

vvcephei commented 3 years ago

It looks like this line is printed in HARouting if it doesn't find a host for any of the partitions:

    final boolean anyPartitionsEmpty = locations.stream()
        .anyMatch(location -> location.getNodes().isEmpty());
    if (anyPartitionsEmpty) {
      LOG.debug("Unable to execute pull query: {}. All nodes are dead or exceed max allowed lag.",
                statement.getStatementText());
      throw new MaterializationException("Unable to execute pull query. "
          + "All nodes are dead or exceed max allowed lag.");
    }

Could be a bug in HARouting or a problem with the Streams metadata discovery logic, or it could indicate that one (or more) of the partitions are not available.

We should be able to refine the message to both call out which of "all nodes are dead" or "exceeded max lag" was the cause, and since it looks like there might be other causes that are hidden here, so we should try to accurately describe any other conditions that might have caused the error.

agavra commented 3 years ago

Bumping this to P0 as it isn't that uncommon. It happened to me in a production CCloud cluster while I was hacking together some things - it required a node bounce to resolve. We should fix this ASAP.

confluentinc / ksql

Improve the error message/logs for failed pull queries (All nodes are dead) #7772