AWS IoT Java SDK fails to publish after a period of inactivity and blocks retries for ~3s

nakulny commented 3 years ago

Hey AWS SDK Team,

We at Latch believe to have noticed this bug where seemingly after a period of inactivity (cause unconfirmed), the AWS IoT Java SDK (com.amazonaws:aws-iot-device-sdk-java:1.3.9) will throw an exception when we try to publish a message to an MQTT channel. This exception is only thrown intermittently and otherwise the publish attempts succeed as expected.

Here's the function we call (mqttClient is an instance of AWSIotMqttClient): mqttClient.publish(topic, QOS, jwt, IOT_TIMEOUT_MS); Here's the exception:

com.amazonaws.services.iot.client.AWSIotException: null
    at com.amazonaws.services.iot.client.core.AwsIotCompletion.get(AwsIotCompletion.java:213)
    at com.amazonaws.services.iot.client.core.AbstractAwsIotClient.publish(AbstractAwsIotClient.java:237)
    at com.amazonaws.services.iot.client.AWSIotMqttClient.publish(AWSIotMqttClient.java:793)
    ...

Therefore, for the time being we've wrapped this code block in a retry on exception strategy as follows:

        int attempts = 0;
        boolean success = false;
        do {
            try {
                mqttClient.publish(topic, QOS, jwt, IOT_TIMEOUT_MS);
                LOGGER.info("Published MQTT Message for Intercom {}", message.getIntercomId());
                success = true;
            } catch (AWSIotException | AWSIotTimeoutException e) {
                LOGGER.warn("Attempt #{} Failed due to {}", attempts, e);
                attempts++;
            }
        } while (!success && attempts < MAX_RETRY + 1);

We use a QoS of 1.

However the issue is that in the event we get this intermittent exception (the one specified above), all retries fail for 3s. For example if we set an IOT_TIMEOUT_MS value of 500ms. We see 5 retry attempts fail due to com.amazonaws.services.iot.client.AWSIotTimeoutException, and the 6th retry succeed. Here are some logs of this happening for the block above: extract-2021-06-17_10-06-88 (1).csv

If I set an IOT_TIMEOUT_MS value of 5000ms (5s), then I see the first retry attempt succeed after 3s. It seems something in the AWS IoT Java SDK needs 3s to recover when that exception is thrown.

We see this issue across multiple environments.

This 3s delay is happening to roughly 15% of our traces for a particular service, and the delay in this functionality is starting to frustrate our customers. We originally filed this bug via the AWS Console's Support Center in order to utilize our Premium Support plan with AWS, but we were directed to file the issue here.

Can you please clarify whether this is a bug in the AWS IoT Java SDK, and how we can solve this issue?

Thanks, Nakul Software Engineer Latch

jmklix commented 3 years ago

Have you noticed a threshold of inactivity when this starts happening?

nakulny commented 3 years ago

When testing in our dev environment, it always seemed more likely that if I waited over an hour, the first trace had a higher probability of reproducing this behavior. It is not consistent though.

For example with our Production application, I can see a host have a successful trace without the publish delay, but then only 7 minutes later experience this issue causing the 3s delay on the new trace. So this behavior doesn't suggest the issue is related to periods of inactivity.

It is definitely possible the period of inactivity is not the actual trigger for the bug, but rather there is some race condition lying underneath. We do see this delayed case relatively often though and can reproduce the issue when running a cycle test that serially executes this code path.

nakulny commented 3 years ago

Okay I had a hunch that the AWSIotMqttClient is disconnecting under the hood so I modified the code block to be as follows:

        int attempts = 0;
        boolean success = false;
        do {
            AWSIotConnectionStatus awsIotConnectionStatus = mqttClient.getConnectionStatus();
            try {
                mqttClient.publish(topic, QOS, jwt, IOT_TIMEOUT_MS);
                LOGGER.info(
                        "Published MQTT Message for Intercom {}, AWS IoT Client Connection Status {}",
                        message.getIntercomId(),
                        awsIotConnectionStatus);
                success = true;
            } catch (AWSIotException | AWSIotTimeoutException e) {
                LOGGER.warn(
                        "AWS IoT Client Connection Status {}, Attempt #{} Failed due to {}",
                        awsIotConnectionStatus,
                        attempts,
                        e);
                attempts++;
            }
        } while (!success && attempts < MAX_RETRY + 1);

I've attached the updated logs around an error trace. extract-2021-07-13_14-07-01.csv

So it seems that on Attempt #0 the AWSIotMqttClient believes it is CONNECTED, but something goes wrong represented by this exception: "com.amazonaws.services.iot.client.AWSIotException: null."

Then the AWSIotMqttClient takes one attempt for its connection status to update to DISCONNECTED. During this period all publish attempts time out ("com.amazonaws.services.iot.client.AWSIotTimeoutException: Request timed out when processing request"). Then eventually after the ~3 seconds the AWSIotMqttClient reports RECONNECTING status and reconnects for Attempt #7 to succeed.

Given the extra information, please let us know if this is a valid bug! Also let us know if there are some configurations we should try tweaking to avoid this delay due to disconnection. Thanks!

TingDaoK commented 3 years ago

I am not so sure about your ask. But due to my understanding, the behavior happens because if the timeout is set longer than the connection get reconnected, the publish will eventually succeed (That's the functionality from underlying paho Client). Our Iot SDK also has the timeout (IOT_TIMEOUT_MS you set) that will fail the request if the timeout happens before the connection reconnects.

Also, it sounds a reasonable latency for me to have ~ 3 secs for client to notice the connection lost. We can have follow up if your request is to have the client to detect the connection status changed faster.

nakulny commented 3 years ago

Anything we can do to either:

(A) Prevent these disconnections from happening as often as they do OR (B) Speed up the reconnection process

will be greatly appreciated. I am open to changing Keep Alive intervals and whatever other AWSIotMqttClient configurations we feel may help. Please follow up my request to have the client detect the connection status changes faster too if we think that will speed up the reconnection.

For reference 12% of our traces involving this flow for Jul 14, 12:00 am – Jul 15, 11:59 pm hit this minimum 3s delay issue, which at the moment is creating a poor User experience for our Customers.

I can obviously change the IOT_TIMEOUT_MS to be longer than 3s to prevent the timeout errors, but that does not solve the underlying delay which is frustrating our Customers.

TingDaoK commented 3 years ago

Do you connect with AWS IoT Core? Or any other IoT broker?

TingDaoK commented 3 years ago

I am not sure what's the root cause is. I think it's worth to try set Keep Alive intervals to something like 1 sec and see how the reconnect process works.

As we don't accept any feature request for v1s and that's not a bug to me, I'll recommend migrating to v2 (Note: Although, for v2, we depend on the OS to detect the socket is closed, and as long as we find out the connection is lost, we will try to reconnect asap, it may still be some latency. It's basically the best as SDK we can do)

For "Prevent these disconnections from happening as often as they do", I think it's a server related problem. As a client, we never disconnect until user requests.

nakulny commented 3 years ago

We use AWS IoT Core.

Since you are able to reproduce the delays due to reconnecting on V1, are you able to advise whether you see the same delays on the V2 SDK? Also are you seeing a similar amount of traces (12%) delayed due to reconnection when testing the V1?

In the meantime, I can try altering the Keep Alive value and Base Retry Delay value.

nakulny commented 3 years ago

Also any recommendations on debugging the "server related problem" with disconnections would be welcome.

bretambrose commented 3 years ago

https://docs.aws.amazon.com/iot/latest/developerguide/cwl-format.html may be helpful.

nakulny commented 3 years ago

Hey all,

We've found that reducing the keep alive to 1s and base retry value to 500ms seem to have eliminated those formerly frequent disconnects so far. Here's the latency chart for our related trace: Screen Shot 2021-07-30 at 12 29 56 PM

We can see the p95 and p90 latency greatly stabilize once I deployed the change.

I might recommend your team considers dropping the default values, but I will leave that to y'all. Thanks for the assistance!

github-actions[bot] commented 3 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

aws / aws-iot-device-sdk-java

AWS IoT Java SDK fails to publish after a period of inactivity and blocks retries for ~3s #170

⚠️COMMENT VISIBILITY WARNING⚠️