Open colebaileygit opened 3 years ago
Seems to go away if this is disabled: KSQL_KSQL_HEARTBEAT_ENABLE
I have similar problem when I run ksqldb in a cluster. pull queries from a table like below don't output any results and terminate without results, but if I remove WITH(key_format='JSON') everything works fine.
CREATE TABLE SUM_A WITH(key_format='JSON') AS
SELECT
id,
SUM(x)
FROM A
GROUP BY x
EMIT CHANGES;
@colebaileygit did you also observe this on a more recent version of ksqldb or only on 0.19.0? I'm still trying to re-produce it.
@patrickstuedi I have since disabled the KSQL_KSQL_HEARTBEAT_ENABLE
setting which made the issue disappear. From memory I was able to reproduce it locally with a docker-compose using 2 ksqldb nodes, but I only tested with 0.19.0
It seems to be the unique combination of KSQL_KSQL_HEARTBEAT_ENABLE: true
but no standby replicas setup. If standby replicas >= number of nodes then the error also does not occur.
I'm confused about the setup. You say you have no standbys set up, but you say the expected behavior is that node A forwards to node B which would only happen if node B is a standby. And the exception indicates that the partition request is forwarded so a standby must be configured. Can you describe your setup, maybe sharing the configs you use for the two nodes?
@patrickstuedi I've prepared a simple repo to reproduce the issue, hopefully this sheds some light on the issue: https://github.com/colebaileygit/ksqldb-demos/tree/master/multi-node-pull-query
Please note that the expected behavior can be produced by changing to KSQL_KSQL_HEARTBEAT_ENABLE: 'false'
in the docker compose file.
As a followup, I've also tested a 3-cluster approach with KSQL_KSQL_HEARTBEAT_ENABLE: 'true'
and KSQL_KSQL_STREAMS_NUM_STANDBY_REPLICAS: 1
and it also seems to have trouble when query is routed to non-active nodes.
Can you share/attach the configs you used for each of the servers?
@patrickstuedi it looks like it's here: https://github.com/colebaileygit/ksqldb-demos/blob/master/multi-node-pull-query/docker-compose.yml
Ah right, @colebaileygit sorry I missed your earlier comment with the link. Thanks @agavra for pointing it out.
Sorry for the delayed response on this. I ran your setup and could re-produce the error. Then I added a few extra config properties after which both querying the active and the standby succeeded. Here is the modified config I used:
ksql-1:
image: confluentinc/ksqldb-server:0.19.0
ports:
- "8088:8088"
depends_on:
- kafka
volumes:
- ./sql:/home/appuser/sql
environment:
KSQL_BOOTSTRAP_SERVERS: kafka:9092
KSQL_KSQL_SERVICE_ID: ksql-local_
KSQL_KSQL_SCHEMA_REGISTRY_URL: http://schema-registry:8081
KSQL_KSQL_PERSISTENCE_DEFAULT_FORMAT_KEY: AVRO
KSQL_KSQL_PERSISTENCE_DEFAULT_FORMAT_VALUE: AVRO
KSQL_CONFLUENT_SUPPORT_METRICS_ENABLE: 'false'
KSQL_KSQL_HEARTBEAT_ENABLE: 'true'
KSQL_KSQL_QUERY_PULL_ENABLE_STANDBY_READS: "true"
KSQL_KSQL_LAG_REPORTING_ENABLE: "true"
KSQL_KSQL_STREAMS_NUM_STANDBY_REPLICAS: 1
## For testing clustering
ksql-2:
image: confluentinc/ksqldb-server:0.19.0
ports:
- "8089:8089"
depends_on:
- kafka
volumes:
- ./sql:/home/appuser/sql
environment:
KSQL_BOOTSTRAP_SERVERS: kafka:9092
KSQL_KSQL_SERVICE_ID: ksql-local_
KSQL_KSQL_SCHEMA_REGISTRY_URL: http://schema-registry:8081
KSQL_KSQL_PERSISTENCE_DEFAULT_FORMAT_KEY: AVRO
KSQL_KSQL_PERSISTENCE_DEFAULT_FORMAT_VALUE: AVRO
KSQL_CONFLUENT_SUPPORT_METRICS_ENABLE: 'false'
KSQL_KSQL_HEARTBEAT_ENABLE: 'true'
KSQL_KSQL_QUERY_PULL_ENABLE_STANDBY_READS: "true"
KSQL_KSQL_LAG_REPORTING_ENABLE: "true"
KSQL_KSQL_STREAMS_NUM_STANDBY_REPLICAS: 1
The main change is:
Other changes:
My understanding of the error is that without those properties set, the second node comes up as a standby (because it's configured with the same service id, but because replicas and lag reporting properties are not set the standby is not properly integrated. The error message above indicates that the query is forwarded but then cannot properly be de-serialized. The error message might be because the standby gets the query but expected a heartbeat, or the other way round, that's just a guess and we'll need to figure that out. Clearly the error message is confusing and not particularly helpful.
I took another look at this issue. I was able to re-produce it on 0.19.0 and also 0.22.0, but the problem seems to have disappeared when using current master.
Describe the bug Simple pull query fails when executing on node A but works on node B. (No standby replicas active)
To Reproduce
Expected behavior Both nodes should return the same result (node A calls node B and forwards result to user)
Actual behaviour Node A returns error
Exhausted standby hosts to try.
but the logs reveal the real error (see context below)Additional context JSON API response from node B:
Full error logs: