aws / aws-iot-device-sdk-java-v2

Next generation AWS IoT Client SDK for Java using the AWS Common Runtime
Apache License 2.0
113 stars 76 forks source link

disconnection with error code [5154] and reason code [SESSION_TAKEN_OVER] #580

Closed kriehzoo closed 4 months ago

kriehzoo commented 5 months ago

Describe the bug

I have a subsribing client written in Java and Spring Boot. When I start the client I keep seeing SESSION_TAKEN_OVER. Already opened a case with AWS support but I was referred to this repo to open an issue here. I know that SESSION_TAKEN_OVER will happen when I use the same client ID for different clients but I only start one client so this cannot be the issue. My suspicion is that resources may be leaking from a previous run? Do I need some cleanup when shutting down my client? I came across CrtResource.waitForNoResources() so is this required? How can I detect if resources are leaking? I use my IoT thing name as client ID.

Building my client looks like this:

@Autowired
  public IoTCoreMqtt5ConnectionHandler(
      @Value("${iot.enabled:false}") Boolean enabled,
      @Value("${iot.thing.name}") @NotEmpty String thingName,
      @Value("${iot.endpoint.host}") String host,
      @Value("${iot.endpoint.port:8883}") @NotNull Long port,
      @Value("${iot.auth.certPath}") @NotEmpty String certPath,
      @Value("${iot.auth.privateKeyPath}") @NotEmpty String privateKeyPath,
      @Value("${iot.auth.caFilePath}") @NotEmpty String caFilePath,
      @Value("${iot.subscribeTimeoutSeconds:10}") Long subscribeTimeoutSeconds,
      @Value("${iot.keepAliveIntervalSeconds:60}") Long keepAliveIntervalSeconds,
      ApplicationEventPublisher applicationEventPublisher, ObjectMapper objectMapper) {
    this.enabled = enabled;
    this.thingName = thingName;
    this.eventTopic = "connector/" + thingName + "/event";
    this.host = host;
    this.port = port;
    this.certPath = certPath;
    this.privateKeyPath = privateKeyPath;
    this.caFilePath = caFilePath;
    this.subscribeTimeoutSeconds = subscribeTimeoutSeconds;
    this.keepAliveIntervalSeconds = keepAliveIntervalSeconds;
    this.applicationEventPublisher = applicationEventPublisher;
    this.objectMapper = objectMapper;
  }

for stop:

/**
   * Disconnect from the MQTT broker and wait for all CRT resources to be cleaned up.
   * A stopped client can be restarted.
   *
   * @throws ExecutionException   if this future completed exceptionally
   * @throws InterruptedException if the current thread was interrupted while waiting
   */
  private void stop() throws ExecutionException, InterruptedException {
    log.debug("stop");
    client.stop(new DisconnectPacket.DisconnectPacketBuilder()
        .withReasonCode(DisconnectPacket.DisconnectReasonCode.NORMAL_DISCONNECTION)
        .withReasonString("client is shutting down")
        .build());
  }

creating the client:

/**
   * Creates a client w/o starting it.
   */
  private void createClient() {
    if (client != null) {
      throw new CreateMqtt5ClientFailedException("Client is already created");
    }

    ConnectPacket.ConnectPacketBuilder connectPacketBuilder = new ConnectPacket.ConnectPacketBuilder()
        .withClientId(thingName)
        .withKeepAliveIntervalSeconds(keepAliveIntervalSeconds);

    SocketOptions socketOptions = new SocketOptions();
    socketOptions.keepAlive = true;

    // NEEDS REVIEW maximum of 7 days and even that may require a change to the account limits using the limits process
    // currently, although setting 7 days, the negotiated session expiry interval is 3600 seconds (1 hour) which
    // would be too short for our use case
    connectPacketBuilder.withSessionExpiryIntervalSeconds(60 * 60 * 24 * 7L); // 7 days in seconds
    try (AwsIotMqtt5ClientBuilder builder =
             AwsIotMqtt5ClientBuilder.newDirectMqttBuilderWithMtlsFromPath(host, certPath, privateKeyPath)
                 .withConnectProperties(connectPacketBuilder)
                 .withCertificateAuthorityFromPath(null, caFilePath)
                 .withLifeCycleEvents(lifecycleEvents)
                 .withPublishEvents(publishEvents)
                 // MQTT broker uses keep alive times 1.5 before it drops an idle connection
                 .withMinReconnectDelayMs((long) (keepAliveIntervalSeconds * 1000 * 1.5))
                 //.withSocketOptions(socketOptions)
                 // persistent session (must have same client ID, and a session expiry > 0)
                 // REVIEW REJOIN_POST_SUCCESS should be enough, but then an existing session was not re-joined
                 .withSessionBehavior(Mqtt5ClientOptions.ClientSessionBehavior.REJOIN_ALWAYS)
                 .withPort(port);
    ) {
      client = builder.build();
    }

    if (client == null) {
      throw new CreateMqtt5ClientFailedException("Failed to create the MQTT-5 client");
    }
  }

starting it:

/**
   * Connect to the MQTT broker using MQTT-5.
   * A client can be stopped and restarted.
   * If the client has not been created, it will be created first.
   *
   * @throws ExecutionException   if this future completed exceptionally
   * @throws InterruptedException if the current thread was interrupted while waiting
   */
  private void startClient() throws ExecutionException, InterruptedException {
    if (client == null) {
      createClient();
    } else {
      if (client.getIsConnected()) {
        throw new StartMqtt5ClientFailedException("Client is already connected");
      }
    }
    log.info("Start");
    client.start();
  }

subscribing:

private void subscribeToConnectorEvents() throws ExecutionException, InterruptedException, TimeoutException {
    QOS qos = QOS.AT_LEAST_ONCE;
    log.debug("subscribing to topic [{}] with QOS [{}]", eventTopic, qos);
    SubAckPacket subAckPacket = client.subscribe(new SubscribePacket.SubscribePacketBuilder()
        .withSubscription(eventTopic,
            qos,
            true,
            false,
            SubscribePacket.RetainHandlingType.SEND_ON_SUBSCRIBE_IF_NEW) // only for retained messages
        .build()).get(subscribeTimeoutSeconds, TimeUnit.SECONDS);
    log.info("subscribed to topic [{}] with reason codes {}: {}",
        eventTopic, subAckPacket.getReasonCodes(), subAckPacket.getReasonString());

    if (!subAckPacket.getReasonCodes().contains(SubAckPacket.SubAckReasonCode.GRANTED_QOS_1)) {
      throw new SubscriptionDoesNotMatchRequirementsException(String.format(
          "QOS [%s] was not granted although QOS [%s] was requested",
          SubAckPacket.SubAckReasonCode.GRANTED_QOS_1, qos));
    }

    if (hasSubscriptionError(subAckPacket)) {
      throw new SubscriptionFailedException(String.format(
          "Subscription to topic [%s] failed with reason codes [%s]: %s",
          eventTopic, subAckPacket.getReasonCodes(), subAckPacket.getReasonString()));
    }
  }

shutdown:

public void onShutdown() {
    if (!enabled) {
      return;
    }
    stopAndCloseClient();
  }

  private void stopAndCloseClient() {
    try {
      stopClient();
    } finally {
      closeClient();
    }
  }

  private void closeClient() {
    log.debug("close");
    // Important: When finished with an MQTT5 client, you must call close() on it or any associated
    // native resource may leak
    client.close();
    //CrtResource.waitForNoResources(); // NEEDS REVIEW
    //log.info("resources cleaned up");
    client = null;
  }

  private void stopClient() {
    try {
      stop();
    } catch (ExecutionException e) {
      throw new DisconnectOnShutdownFailedException(e);
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt(); // re-interrupt the thread, no need to fail the shutdown
    }
  }

  public void onContextClosedEvent(ContextClosedEvent event) {
    log.debug("onApplicationEvent: {}", event);
    // Important: When finished with an MQTT5 client, you must call close() on it or any associated
    // native resource may leak
    onShutdown();
  }

startup:

public void onContextRefreshedEvent(ContextRefreshedEvent event) {
    log.debug("onApplicationEvent: {}", event);
    onStartup();
  }

  private void onStartup() {
    if (!enabled) {
      log.warn("IoT Core is disabled");
      return;
    }

    try {
      startClient();
    } catch (InterruptedException ex) {
      Thread.currentThread().interrupt(); // re-interrupt the thread
      throw new StartMqtt5ClientFailedException(ex); // and fail the startup

    } catch (CrtRuntimeException | ExecutionException ex) {
      throw new StartMqtt5ClientFailedException(ex);
    }

    // REVIEW wait for the initial connection to be established before subscribing
    // Once connected, the MQTT5 client can subscribe to one or more topics
    // make sure not to do this in the connection success callback because it will
    // block the MQTT event loop, and you will run into a timeout
    try {
      subscribeToConnectorEvents();
    } catch (ExecutionException | InterruptedException | TimeoutException e) {
      if (e instanceof InterruptedException) {
        Thread.currentThread().interrupt(); // re-interrupt the thread
      }
      throw new ConnectAndSubscribeTopicsFailedException("Failed to subscribe", e);
    }
  }

So I think these are all relevant code snippets.

Expected Behavior

to not see SESSION_TAKEN_OVER at all

Current Behavior

15:45:33.113 [main] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - onApplicationEvent: org.springframework.context.event.ContextRefreshedEvent[source=org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@305f031, started on Tue May 07 15:45:31 CEST 2024] 15:45:33.500 [main] INFO c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - Start 15:45:33.500 [main] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - subscribing to topic [connector/connector-608/event] with QOS [AT_LEAST_ONCE] 15:45:33.500 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect 15:45:33.889 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties [] 15:45:33.960 [main] INFO c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - subscribed to topic [connector/connector-608/event] with reason codes [GRANTED_QOS_1]: null 15:45:33.965 [main] INFO c.m.c.s.ConnectorClientApplication - Started ConnectorClientApplication in 2.44 seconds (process running for 2.796) 15:45:34.167 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154] 15:45:34.167 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null 15:47:01.925 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect 15:47:02.426 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties [] 15:47:03.127 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154] 15:47:03.127 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null 15:47:29.742 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect 15:47:30.053 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties [] 15:47:31.157 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154] 15:47:31.157 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null 15:49:31.158 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect 15:49:31.471 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties [] 15:49:32.455 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154] 15:49:32.455 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null 15:50:55.170 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect 15:50:55.534 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties [] 15:50:56.696 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154] 15:50:56.696 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null 15:51:08.092 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect 15:51:08.392 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties [] 15:51:08.939 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]

Reproduction Steps

Relevant source files were provided to AWS support. This issue will also be linked there.

Possible Solution

I suspect a resource leak.

Additional Information/Context

No response

SDK version used

1.20.1

Environment details (OS name and version, etc.)

Mac 14.3.1

kriehzoo commented 5 months ago

I enabled CRT tracing and native debugging by adding these VM options: -Daws.crt.log.level=Trace -Daws.crt.log.destination=Stdout -Daws.crt.debugnative=true Also I added a call to CrtResource.logNativeResources(); on startup and shutdown to see if something is leaking. Plus a call to CrtResource.waitForNoResources(); after closing the client.

When just running my client and waiting for a couple of minutes, I keep getting the SESSION_TAKEN_OVER:

[TRACE] [2024-05-07T15:45:31Z] [00000001727eb000] [event-loop] - id=0x144f2ed60: waiting for a maximum o17:46:21.263 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
17:46:21.263 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
f 100s 0ns

Next time:

[TRACE] [2024-05-07T15:46:30Z] [00000001725bb000] [ev17:46:30.448 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties []
17:46:31.716 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
17:46:31.716 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
ent-loop] - id=0x144f2be90: wake up with 0 events to process.

This goes on for a while until I stop my app.

Some suspicous things that I see from the logs:

[ERROR] [2024-05-07T15:59:11Z] [00000001725bb000] [mqtt5-client] - MQTT5 client termination function in JNI called, but could not get env

And after waiting for the 1 minute timeout:

[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - waitForNoResources - timeOut
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - Dumping native object set:
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 0, Class SocketOptions, Refs 1](2024-05-07T15:38:51.039533Z) - <null>
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 1, Class TlsContextOptions, Refs 1](2024-05-07T15:38:51.042230Z) - <null>
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 3, Class TlsContext, Refs 1](2024-05-07T15:38:51.042744Z) - <null>
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 5, Class ClientBootstrap, Refs 1](2024-05-07T15:38:51.072969Z) - <null>
   Forward references by Id: 7 6 
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 6, Class EventLoopGroup, Refs 2](2024-05-07T15:38:51.072987Z) - <null>
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 7, Class HostResolver, Refs 1](2024-05-07T15:38:51.073470Z) - <null>
   Forward references by Id: 6 
[WARN] [2024-05-07T16:00:13Z] [000000016c657000] [JavaCrtGeneral] - Not all native threads were successfully joined during gentle shutdown.  Memory may be leaked.

Process finished with exit code 130 (interrupted by signal 2:SIGINT)

Hope this helps, I can also provide full CRT logs if that would help.

kriehzoo commented 5 months ago

Another suspicious line from the same logs:

[ERROR] [2024-05-07T15:59:11Z] [00000001725bb000] [mqtt5-client] - id=0x144bd3db0: client channel shutdown invoked from unexpected state 6(PENDING_RECONNECT)

sbSteveK commented 5 months ago

The SESSION_TAKEN_OVER error code is sent in a disconnect packet from the server when the server gets a connect packet containing the same client id as a client that is currently connected to it. The server will accept the connection from the new client and disconnect the previously connected client.

You probably have two clients running concurrently using the same client id and they are taking turns disconnecting each other. To test whether this is the issue, try changing your createClient() function to use a UUID to generate a client id instead of using thing name at .withClientId(thingName). If this resolves the SESSION_TAKEN_OVER issue, you can then look into your code to figure out how and why it's creating more than one client.

You can also setup and check Cloudwatch logs to see the incoming CONNECT packets to your MQTT endpoint. If you see more than one CONNECT packet with two different UUID client ids, that will also confirm that something is creating multiple clients and starting them up.

kriehzoo commented 4 months ago

Hi Kim, thank you for your suggestions. I already checked for multiple clients using the same client ID and also reviewed my code for this. However, I am pretty sure that I am not using the same client id. Still I will try what you suggested although this will probably give me issues with IoT permissions as the permission policy is based on the thing name. I will post the outcome as soon as possible. And I will double check the CloudWatch logs. In the meantime, what about my previous comments on suspicious log output from the SDK?

kriehzoo commented 4 months ago

Hello Kim,

W/o any changes to the thing name after starting my client, here are the events as shown in IoTCoreV2 logs on CloudWatch.

I get 3 events: Connect, Subscribe, Disconnect in the first log stream:

{ "timestamp": "2024-05-14 14:52:53.812", "logLevel": "INFO", "traceId": "a9f601e6-295a-ebc3-d168-459817ead22c", "accountId": "851725189817", "status": "Success", "eventType": "Connect", "protocol": "MQTT", "clientId": "connector-608", "principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836", "sourceIp": "80.120.7.190", "sourcePort": 47021 }

{ "timestamp": "2024-05-14 14:52:53.873", "logLevel": "INFO", "traceId": "e7870603-26d5-c35f-2450-16ea327a34e7", "accountId": "851725189817", "status": "Success", "eventType": "Subscribe", "protocol": "MQTT", "topicName": "connector/connector-608/event", "subscriptions": [ { "topicName": "connector/connector-608/event", "reasonCode": 1 } ], "clientId": "connector-608", "principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836", "sourceIp": "80.120.7.190", "sourcePort": 47021 }

{ "timestamp": "2024-05-14 14:52:54.916", "logLevel": "ERROR", "traceId": "7b921e95-2c25-e6a4-3ad8-e32471faa7d3", "accountId": "851725189817", "status": "Failure", "eventType": "Disconnect", "protocol": "MQTT", "clientId": "connector-608", "principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836", "sourceIp": "80.120.7.190", "sourcePort": 47021, "reason": "DUPLICATE_CLIENT_ID", "details": "A new connection was established with the same client ID", "disconnectReason": "DUPLICATE_CLIENTID" }

then the next log stream shows a connect event:

{ "timestamp": "2024-05-14 14:54:11.414", "logLevel": "INFO", "traceId": "f6676a00-5383-fa83-aa57-8a101cbe619b", "accountId": "851725189817", "status": "Success", "eventType": "Connect", "protocol": "MQTT", "clientId": "connector-608", "principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836", "sourceIp": "80.120.7.190", "sourcePort": 5666 }

and another log stream contains a disconnect:

{ "timestamp": "2024-05-14 14:56:12.377", "logLevel": "ERROR", "traceId": "a340d241-125f-cbbe-0734-89bd69d46ed4", "accountId": "851725189817", "status": "Failure", "eventType": "Disconnect", "protocol": "MQTT", "clientId": "connector-608", "principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836", "sourceIp": "80.120.7.190", "sourcePort": 40656, "reason": "DUPLICATE_CLIENT_ID", "details": "A new connection was established with the same client ID", "disconnectReason": "DUPLICATE_CLIENTID" }

kriehzoo commented 4 months ago

When I stop my client i get client logs:

17:02:55.585 [AwsEventLoop 12] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null 17:03:32.011 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - onApplicationEvent: org.springframework.context.event.ContextClosedEvent[source=org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@305f031, started on Tue May 14 16:52:51 CEST 2024] 17:03:32.011 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - stop 17:03:32.013 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - close 17:03:32.013 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - closed 17:03:32.013 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - waitForNoResources 17:04:32.202 [SpringApplicationShutdownHook] WARN c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - Failed to wait for all resources being cleaned up java.lang.RuntimeException: Timeout waiting for resource count to drop to zero at software.amazon.awssdk.crt.CrtResource.waitForNoResources(CrtResource.java:448) at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.closeClient(IoTCoreMqtt5ConnectionHandler.java:417) at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.stopAndCloseClient(IoTCoreMqtt5ConnectionHandler.java:404) at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.onShutdown(IoTCoreMqtt5ConnectionHandler.java:397) at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.onContextClosedEvent(IoTCoreMqtt5ConnectionHandler.java:443) at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.onApplicationEvent(IoTCoreMqtt5ConnectionHandler.java:389) at org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(SimpleApplicationEventMulticaster.java:185) at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:178) at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:156) at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:451) at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:384) at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1130) at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.doClose(ServletWebServerApplicationContext.java:174) at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:1093) at org.springframework.boot.SpringApplicationShutdownHook.closeAndWait(SpringApplicationShutdownHook.java:145) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at org.springframework.boot.SpringApplicationShutdownHook.run(SpringApplicationShutdownHook.java:114) at java.base/java.lang.Thread.run(Thread.java:1583) 17:04:32.215 [SpringApplicationShutdownHook] INFO o.s.i.e.SourcePollingChannelAdapter - stopped bean 'fileWatcherFlow.org.springframework.integration.config.SourcePollingChannelAdapterFactoryBean#0'; defined in: 'class path resource [com/medina/connector/standalone/file/FileWatcherConfig.class]'; from source: 'bean method fileWatcherFlow' 17:04:32.215 [SpringApplicationShutdownHook] INFO o.s.i.channel.DirectChannel - Channel 'application.fileWatcherFlow.channel#0' has 0 subscriber(s). 17:04:32.215 [SpringApplicationShutdownHook] INFO o.s.i.endpoint.EventDrivenConsumer - stopped bean 'fileWatcherFlow.org.springframework.integration.config.ConsumerEndpointFactoryBean#0'; defined in: 'class path resource [com/medina/connector/standalone/file/FileWatcherConfig.class]'; from source: 'bean method fileWatcherFlow' 17:04:32.220 [SpringApplicationShutdownHook] INFO o.s.i.endpoint.EventDrivenConsumer - Removing {logging-channel-adapter:_org.springframework.integration.errorLogger} as a subscriber to the 'errorChannel' channel 17:04:32.220 [SpringApplicationShutdownHook] INFO o.s.i.c.PublishSubscribeChannel - Channel 'application.errorChannel' has 0 subscriber(s). 17:04:32.220 [SpringApplicationShutdownHook] INFO o.s.i.endpoint.EventDrivenConsumer - stopped bean '_org.springframework.integration.errorLogger'

Process finished with exit code 130 (interrupted by signal 2:SIGINT)

the event that matches the shutdown in CloudWatch:

{ "timestamp": "2024-05-14 15:02:55.634", "logLevel": "ERROR", "traceId": "1b441fcb-5258-47a5-283c-1cea49068f0d", "accountId": "851725189817", "status": "Failure", "eventType": "Disconnect", "protocol": "MQTT", "clientId": "connector-608", "principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836", "sourceIp": "80.120.7.190", "sourcePort": 54916, "reason": "DUPLICATE_CLIENT_ID", "details": "A new connection was established with the same client ID", "disconnectReason": "DUPLICATE_CLIENTID" }

and that is the last IoT Event i get there with the next minutes, so it seems this was indeed the only client running

kriehzoo commented 4 months ago

as expected, connect fails when using some random UUID:

connection failure with error code: [5150]: assignedClientIdentifier [null], reasonCode [NOT_AUTHORIZED], reasonString [CONNACK:Client is not authenticated/authorized to send the message:0af2347e-d534-cb34-502c-af1bac3ebdb4]

kriehzoo commented 4 months ago

I also now added thread synchronization using a CountDownLatch so I make sure the subsribe is only done after the successful connect. The issue persists, however.

alfred2g commented 4 months ago

Hello @kriehzoo , Try to expand the certificates permissions to allow any client ID or to create a separate cert/key for testing purposes if you cannot modify the one you are currently using and try the uuid generated client id again. Also be sure to specify that the client ID should be generated and used at time of client creation, and not stored anywhere to prevent the identical generated client id from being used again.

bretambrose commented 4 months ago

I feel like we are not going in the right direction in diagnosing this.

What we need are SDK logs at the Debug level. You can find documentation for doing so here: https://awslabs.github.io/aws-crt-java/software/amazon/awssdk/crt/Log.html

I will also echo what's been said previously: this behavior only occurs if multiple clients are fighting over the same client id.

Beyond logs, a complete working sample that repros the issue would also be helpful.

kriehzoo commented 4 months ago

Hi @bretambrose with SDK logs at Debug level you mean CRT Logging? I enabled that before to get trace output and provided the relevant part. I also provided my full client code to AWS support in the related support case. Could you pls get in contact with them because I suppose I am not allowed by my company to post the full Java class here.

kriehzoo commented 4 months ago

@alfred2g I will try that and let you know

kriehzoo commented 4 months ago

@alfred2g I added a second IoT policy that allows any client using the same certificate and I changed to use a UUID as client ID. Now I do not see any connection related errors since more than 5 minutes which should be enough to verify it is working. That being said, I then changed it back to use the thingName and it is also working. That is unexpected.

bretambrose commented 4 months ago

Hi @bretambrose with SDK logs at Debug level you mean CRT Logging? I enabled that before to get trace output and provided the relevant part. I also provided my full client code to AWS support in the related support case. Could you pls get in contact with them because I suppose I am not allowed by my company to post the full Java class here.

Please include full debug logs. Snippets aren't of much use.

kriehzoo commented 4 months ago

@alfred2g I also removed the second IoT policy again and still I do not get the SESSION_TAKEN_OVER so far. I will keep an eye on it.

kriehzoo commented 4 months ago

What about the waitForNoResources(), am I supposed to call it when closing my client? As mentioned, it always hangs there for some time, then is interrupted with an exception. Would it be safe to remove it?

kriehzoo commented 4 months ago

@bretambrose next time it happens I will send you full debug logs. Thanks for your support, also to @alfred2g

bretambrose commented 4 months ago

So this only happens with low frequency?

kriehzoo commented 4 months ago

@bretambrose it happened every time when I started the client and now it seems to be gone and to be honest, I am not sure, why

kriehzoo commented 4 months ago

If you could give me a quick indication on waitForNoResources() I could probably resolve this issue.

bretambrose commented 4 months ago

@bretambrose it happened every time when I started the client and now it seems to be gone and to be honest, I am not sure, why

Seems very likely that a second process -- either on your machine or on a colleague's machine -- was using the same client id and account.

You don't need to bother with waitForNoResources. That's a public-internal API we used to verify native resources aren't leaking within tests/samples.

kriehzoo commented 4 months ago

@bretambrose thank you for the quick response, so I will remove the call to waitForNoResources() After some testing I will update this issue.

kriehzoo commented 4 months ago

@bretambrose issue cannot be reproduced so I will close it

github-actions[bot] commented 4 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.