Closed agavra closed 3 years ago
It looks like this line is printed in HARouting if it doesn't find a host for any of the partitions:
final boolean anyPartitionsEmpty = locations.stream()
.anyMatch(location -> location.getNodes().isEmpty());
if (anyPartitionsEmpty) {
LOG.debug("Unable to execute pull query: {}. All nodes are dead or exceed max allowed lag.",
statement.getStatementText());
throw new MaterializationException("Unable to execute pull query. "
+ "All nodes are dead or exceed max allowed lag.");
}
Could be a bug in HARouting or a problem with the Streams metadata discovery logic, or it could indicate that one (or more) of the partitions are not available.
We should be able to refine the message to both call out which of "all nodes are dead" or "exceeded max lag" was the cause, and since it looks like there might be other causes that are hidden here, so we should try to accurately describe any other conditions that might have caused the error.
Bumping this to P0
as it isn't that uncommon. It happened to me in a production CCloud cluster while I was hacking together some things - it required a node bounce to resolve. We should fix this ASAP.
I was looking at some logs and for a cluster that has no lag and only a single host (which was alive at the time) I was seeing the following exception:
The error message doesn't help me understand what happened, which partition failed or if there's some other issue. We should print more debugging information.