AxonFramework / AxonFramework

Framework for Evolutionary Message-Driven Microservices on the JVM
https://axoniq.io/
Apache License 2.0
3.34k stars 794 forks source link

No reconnect of AxonQueryBus after the AxonServer restarts #1411

Closed akraskovski closed 4 years ago

akraskovski commented 4 years ago

Hi guys! I'm duplicating the issue message from google groups since I'm not the only one who caught this problem :)

AxonServer 4.3.1 (running in k8s stateful set. manifest was used from the official doc) Spring Boot app with axon-spring-boot-starter 4.3.1

The situation by steps(for running instance of AxonServer):

  1. Spring Boot application starts with the following logs:
12:44:24 [main]  INFO  o.a.a.c.AxonServerConnectionManager                                  - Requesting connection details from localhost:8124
12:44:25 [main]  INFO  o.a.a.c.AxonServerConnectionManager                                  - Reusing existing channel
12:44:25 [main]  INFO  o.a.a.c.AxonServerConnectionManager                                  - Re-subscribing commands and queries
12:44:25 [main]  INFO  o.a.a.c.query.AxonServerQueryBus                                         - Creating new query stream subscriber
12:44:25 [main]  INFO  o.a.a.c.command.AxonServerCommandBus                           - Creating new command stream subscriber
  1. In Axon dashboard I'm able to see available list of commands and queries. Then AxonServer restarts

  2. Spring Boot app becomes reconnect to AxonServer, and when connection is established prints the following logs:

12:53:01 [AxonServerConnectionManager-0] INFO  o.a.a.c.AxonServerConnectionManager                                  - Re-subscribing commands and queries
12:53:01 [AxonServerConnectionManager-0] INFO  o.a.a.c.command.AxonServerCommandBus                                 - Resubscribing Command handlers with AxonServer
12:53:01 [AxonServerConnectionManager-0] INFO  o.a.a.c.command.AxonServerCommandBus                                 - Creating new command stream subscriber

Here we didn't see a message like "Creating new query stream subscriber", so after successful reconnection, there are no queries in the Axon dashboard. Since I'm no able to push queries via queryGateway because no @QueryHandler is registered after the restart.

P.S with command subscribers, everything is good. It shows commands in the dashboard and handlers still working.

P.S The solution which helped to me(changed directly in AxonFramework source code) is:

  1. AxonServerCommandBus.CommandProcessor.getSubscriberObserver#onError(line 792) - has line "subscriberStreamObserver = null;", which means, that new observer will be created on next connection.
  2. AxonServerQueryBus.QueryProcessor.getSubscriberObserver#onError(line 721) - doesn't have "setting to null" the observer. So, it never re-creates the observer.

Hope you'll help to find the answer!

Thanks!

smcvb commented 4 years ago

Thanks for bringing this up @akraskovski. We've noticed several hiccups with regards to reconnecting the AxonServerQueryBus, like #1397 and #1386. Especially #1397 points towards what you are suggesting as well.

My guess is that your issue will thus be resolved with that fix in place. If you have the time, making a local snapshot build of framework would allow you to verify reconnects work with this in place. If you do not have the time to build a snapshot yourself, I'd suggest to wait for 4.3.2 to be released, somewhere this week.

Hope this clarifies your options @akraskovski!

akraskovski commented 4 years ago

Thanks, @smcvb for the answer!

We'll wait for this fix in 4.3.2

smcvb commented 4 years ago

@akraskovski, I am writing because I've just released 4.3.2. Would you mind giving it a spin and seeing whether the same issue still occurs?

akraskovski commented 4 years ago

@smcvb Hi! Just checked it using tagged 4.3.2 revision and it seems that everything is working!

Logs from Spring application after AxonServer restart now are the following: ` 19:24:01 [AxonServerConnectionManager-0] INFO o.a.a.c.AxonServerConnectionManager - Reusing existing channel

19:24:01 [AxonServerConnectionManager-0] INFO o.a.a.c.AxonServerConnectionManager - Re-subscribing commands and queries

19:24:01 [AxonServerConnectionManager-0] INFO o.a.a.c.command.AxonServerCommandBus - Resubscribing Command handlers with AxonServer

19:24:01 [AxonServerConnectionManager-0] INFO o.a.a.c.command.AxonServerCommandBus - Creating new command stream subscriber

19:24:01 [AxonServerConnectionManager-0] INFO o.a.a.c.query.AxonServerQueryBus - Creating new query stream subscriber `

And in the dashboard screen, I become able to see registered query handlers from this app

It's really good news, that solution was found so quickly. Thank you! Will wait for uploaded 4.3.2 spring-boot-starter release in maven central :)

smcvb commented 4 years ago

That's great news @akraskovski! I will close this issue as resolved then.

Thanks again for filing this concern with us and hope to see/chat with you in the future!

dimazelinskyi commented 3 years ago

Hello @smcvb,

I faced with the same issue but with latest version of Axon

AxonServer 4.4.9 (running in k8s) Spring Boot app with axon-spring-boot-starter 4.4.6

Spring Boot app logs:

17:35:52.642  WARN  i.a.a.c.impl.AxonServerManagedChannel    : Connecting to AxonServer node [localhost:8124] failed: UNAVAILABLE: io exception
17:35:52.642  INFO  i.a.a.c.impl.AxonServerManagedChannel    : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
17:35:54.647  INFO  i.a.a.c.impl.AxonServerManagedChannel    : Requesting connection details from localhost:8124
17:35:55.878  INFO  i.a.a.c.impl.AxonServerManagedChannel    : Successfully connected to localhost:8124
17:35:56.250  INFO  i.a.a.connector.impl.ControlChannelImpl  : Connected instruction stream for context 'default'. Sending client identification
17:35:56.250  INFO  i.a.a.c.command.impl.CommandChannelImpl  : CommandChannel for context 'default' connected, 12 command handlers registered
17:36:09.782  INFO   o.a.e.TrackingEventProcessor             : Fetched token: IndexTrackingToken{globalIndex=8534} for segment: Segment[0/0]
17:36:09.782  INFO   o.a.e.TrackingEventProcessor             : Fetched token: IndexTrackingToken{globalIndex=8534} for segment: Segment[0/0]
17:36:09.787  INFO   o.a.e.TrackingEventProcessor             : Fetched token: IndexTrackingToken{globalIndex=8534} for segment: Segment[0/0]

There are no logs like this (about Query Channel) for example:

16:42:38.153  INFO i.a.a.c.query.impl.QueryChannelImpl      : QueryChannel for context 'default' connected, 6 registrations resubscribed

And also on Axon dashboard there are no Query handlers.

abuijze commented 3 years ago

Hi @dimazelinskyi,

do you have any occurrences of QueryChannel for context 'default' connected in the beginning of the logs, where the application started? It's probably literally QueryChannel for context 'default' connected, 0 registrations resubscribed.

The reconnect logic for both the Command and Query channels are identical. The only situation I currently see where one connects and the other doesn't is when there are no registered Query Handlers.

abuijze commented 3 years ago

Or do you see anything like Error on QueryChannel for context being logged? That should happen around the time where the connection is logged.

dimazelinskyi commented 3 years ago

Hi @abuijze,

Thanks for your answer.

It's probably literally QueryChannel for context 'default' connected, 0 registrations resubscribed. The only situation I currently see where one connects and the other doesn't is when there are no registered Query Handlers.

I think it was because query handlers were correctly working and returning expected values. (Unfortunately for now I do not have whole logs from starting app)

Error on QueryChannel for context

I do not have such message in logs.

This is what happens from reconnecting point:

Fetched token: IndexTrackingToken{globalIndex=8534} for segment: Segment[0/0]
Released claim
Released claim
Releasing claim on token and preparing for retry in 60s
Released claim
Requesting connection details from axon:8124
Connecting to AxonServer node [axon:8124] failed: UNAVAILABLE: io exception
Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
Requesting connection details from axon:8124
Connecting to AxonServer node [axon:8124] failed: UNAVAILABLE: io exception
Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
Requesting connection details from axon:8124
....

Successfully connected to axon:8124
Connected instruction stream for context 'default'. Sending client identification
CommandChannel for context 'default' connected, 12 command handlers registered
Fetched token: IndexTrackingToken{globalIndex=8534} for segment: Segment[0/0]

It looks like it totally skip reconnecting QueryHandlers (no query handlers on dashboard)

As temporary solution I downgraded to version axonserver:4.4.8

@abuijze if you have any suggestions how to trace such cases or any other steps that can help with this it would be really good.

Thanks!

abuijze commented 3 years ago

I don't think the issue is related to the version of AxonServer you're using. Especially the difference between 4.4.8 and 4.4.9 should not be able to influence the connectivity.

Instead, I think the issue is in the connector. However, looking at the code, I currently don't see a path where it could fail to attempt a reconnect on the QueryChannel. If you have any logs that you can share which are related to losing connectivity, that would be very helpful.

abuijze commented 3 years ago

Do you have any other axon related dependencies in your project, other than the Axon SpringBoot Starter?

dimazelinskyi commented 3 years ago

I don't think the issue is related to the version of AxonServer you're using. Especially the difference between 4.4.8 and 4.4.9 should not be able to influence the connectivity.

Instead, I think the issue is in the connector. However, looking at the code, I currently don't see a path where it could fail to attempt a reconnect on the QueryChannel. If you have any logs that you can share which are related to losing connectivity, that would be very helpful.

All logs related to reconnecting issue that I have I added above in the comments.

Do you have any other axon related dependencies in your project, other than the Axon SpringBoot Starter?

Except from Spring-Boot-Starter we have

+--- org.axonframework:axon-test:4.4.6
|    \--- org.axonframework:axon-eventsourcing:4.4.6 (*)

This is our Spring-Boot-Starter

+--- org.axonframework:axon-spring-boot-starter:4.4.6
|    +--- org.axonframework:axon-spring-boot-autoconfigure:4.4.6
|    |    +--- org.axonframework:axon-spring:4.4.6
|    |    |    +--- org.axonframework:axon-configuration:4.4.6
|    |    |    |    +--- org.axonframework:axon-eventsourcing:4.4.6
|    |    |    |    |    +--- org.axonframework:axon-modelling:4.4.6
|    |    |    |    |    |    +--- org.axonframework:axon-messaging:4.4.6
|    |    |    |    |    |    |    +--- org.slf4j:slf4j-api:1.7.28 -> 1.7.30
|    |    |    |    |    |    |    +--- com.thoughtworks.xstream:xstream:1.4.14
|    |    |    |    |    |    |    \--- xpp3:xpp3:1.1.4c
|    |    |    |    |    |    \--- org.slf4j:slf4j-api:1.7.28 -> 1.7.30
|    |    |    |    |    \--- org.axonframework:axon-messaging:4.4.6 (*)
|    |    |    |    +--- org.axonframework:axon-disruptor:4.4.6
|    |    |    |    |    +--- org.axonframework:axon-eventsourcing:4.4.6 (*)
|    |    |    |    |    \--- com.lmax:disruptor:3.4.2
|    |    |    |    +--- org.axonframework:axon-modelling:4.4.6 (*)
|    |    |    |    \--- org.axonframework:axon-messaging:4.4.6 (*)
|    |    |    +--- org.slf4j:slf4j-api:1.7.28 -> 1.7.30
|    |    |    +--- org.axonframework:axon-eventsourcing:4.4.6 (*)
|    |    |    +--- org.axonframework:axon-modelling:4.4.6 (*)
|    |    |    \--- org.axonframework:axon-messaging:4.4.6 (*)
|    |    \--- org.axonframework:axon-server-connector:4.4.6
|    |         \--- io.axoniq:axonserver-connector-java:4.4.6
|    |              +--- org.slf4j:slf4j-api:1.7.30
|    |              +--- io.grpc:grpc-protobuf:1.33.0
|    |              |    +--- io.grpc:grpc-api:1.33.0
|    |              |    |    +--- io.grpc:grpc-context:1.33.0
|    |              |    |    +--- com.google.code.findbugs:jsr305:3.0.2
|    |              |    |    +--- com.google.guava:guava:29.0-android
|    |              |    |    |    +--- com.google.guava:failureaccess:1.0.1
|    |              |    |    |    +--- com.google.guava:listenablefuture:9999.0-empty-to-avoid-conflict-with-guava
|    |              |    |    |    +--- com.google.code.findbugs:jsr305:3.0.2
|    |              |    |    |    +--- org.checkerframework:checker-compat-qual:2.5.5
|    |              |    |    |    +--- com.google.errorprone:error_prone_annotations:2.3.4
|    |              |    |    |    \--- com.google.j2objc:j2objc-annotations:1.3
|    |              |    |    +--- com.google.errorprone:error_prone_annotations:2.3.4
|    |              |    |    \--- org.codehaus.mojo:animal-sniffer-annotations:1.18
|    |              |    +--- com.google.code.findbugs:jsr305:3.0.2
|    |              |    +--- com.google.protobuf:protobuf-java:3.12.0
|    |              |    +--- com.google.api.grpc:proto-google-common-protos:1.17.0
|    |              |    +--- io.grpc:grpc-protobuf-lite:1.33.0
|    |              |    |    +--- io.grpc:grpc-api:1.33.0 (*)
|    |              |    |    +--- com.google.code.findbugs:jsr305:3.0.2
|    |              |    |    +--- com.google.guava:guava:29.0-android (*)
|    |              |    |    +--- com.google.errorprone:error_prone_annotations:2.3.4
|    |              |    |    \--- org.codehaus.mojo:animal-sniffer-annotations:1.18
|    |              |    +--- com.google.guava:guava:29.0-android (*)
|    |              |    +--- com.google.errorprone:error_prone_annotations:2.3.4
|    |              |    \--- org.codehaus.mojo:animal-sniffer-annotations:1.18
|    |              +--- io.grpc:grpc-stub:1.33.0
|    |              |    +--- io.grpc:grpc-api:1.33.0 (*)
|    |              |    +--- com.google.guava:guava:29.0-android (*)
|    |              |    +--- com.google.errorprone:error_prone_annotations:2.3.4
|    |              |    +--- org.codehaus.mojo:animal-sniffer-annotations:1.18
|    |              |    \--- com.google.code.findbugs:jsr305:3.0.2
|    |              +--- io.grpc:grpc-netty:1.33.0
|    |              |    +--- io.grpc:grpc-core:1.33.0
|    |              |    |    +--- io.grpc:grpc-api:1.33.0 (*)
|    |              |    |    +--- com.google.code.gson:gson:2.8.6
|    |              |    |    +--- com.google.android:annotations:4.1.1.4
|    |              |    |    +--- org.codehaus.mojo:animal-sniffer-annotations:1.18
|    |              |    |    +--- com.google.errorprone:error_prone_annotations:2.3.4
|    |              |    |    +--- com.google.guava:guava:29.0-android (*)
|    |              |    |    +--- io.perfmark:perfmark-api:0.19.0
|    |              |    |    |    \--- com.google.code.findbugs:jsr305:3.0.2
|    |              |    |    \--- com.google.code.findbugs:jsr305:3.0.2
|    |              |    +--- io.netty:netty-codec-http2:4.1.51.Final -> 4.1.50.Final (*)
|    |              |    +--- io.netty:netty-handler-proxy:4.1.51.Final -> 4.1.50.Final (*)
|    |              |    +--- com.google.guava:guava:29.0-android (*)
|    |              |    +--- com.google.errorprone:error_prone_annotations:2.3.4
|    |              |    +--- io.perfmark:perfmark-api:0.19.0 (*)
|    |              |    +--- org.codehaus.mojo:animal-sniffer-annotations:1.18
|    |              |    \--- com.google.code.findbugs:jsr305:3.0.2
|    |              \--- io.netty:netty-tcnative-boringssl-static:2.0.31.Final
|    +--- org.axonframework:axon-spring:4.4.6 (*)
|    +--- org.axonframework:axon-configuration:4.4.6 (*)
|    +--- org.axonframework:axon-eventsourcing:4.4.6 (*)
|    +--- org.axonframework:axon-modelling:4.4.6 (*)
|    \--- org.axonframework:axon-messaging:4.4.6 (*)

It looks like dependencies are clean

smcvb commented 3 years ago

We've yesterday released a new version of the axonserver-connector-java, which is a bit more thorough in performing reconnects if the channel states to be unavailable. Might be worth giving this a try to see whether it resolves your problem @dimazelinskyi.

You can tap into this, by updating your Axon Framework version to 4.4.7. This will automatically also pull in axonserver-connector-java 4.4.7.

dimazelinskyi commented 3 years ago

We've updated with latest version that you suggested. Will see. Thanks!