awslabs / aws-iot-device-client

The AWS IoT Device Client provides device-side functionality for AWS IoT Services.
Apache License 2.0
131 stars 68 forks source link

Greengrass SecureTunneling component crashes on start with `software.amazon.awssdk.crt.mqtt.MqttException` #346

Closed Staff-d closed 1 year ago

Staff-d commented 2 years ago

Describe the bug

The Greengrass aws.greengrass.SecureTunneling component crashes on start with software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.. if no internet connection is available and is marked as broken afterwards. To Reproduce

Steps to reproduce the behavior:

  1. Install Greengrass on a device of your choice
  2. Create a deployment containing the SecureTunneling component. Deploy the deployment to your device.
  3. Connect to your device via a secure tunnel to make sure that the component works
  4. Disconnect or otherwise disrupt the internet connection to your device. Reboot your device.
  5. Wait for a few minutes while the device boots and greengrass starts.
  6. Reconnect to the internet.

Expected behavior

I expect that the secure tunneling component becomes available some time after reconnecting a greengrass device to the internet.

Actual behavior

The SecureTunneling component is marked as broken, as it does not handle the software.amazon.awssdk.crt.mqtt.MqttException exception, which is raised when no connection to IoT Core can be made. Connection to the device is not possible until a reboot occurs or the gg service is restarted via systemctl.

Logs

I have attached two logs for you: The output of the secure tunneling component and the contents of the greengrass.log. As they are quite long I'd like to highlight a few key events for you. My comments are marked with a #:

# The device is started without a connection to the internet
2022-10-21T07:21:02.059Z [INFO] (main) com.aws.greengrass.util.platforms.Platform: Getting platform instance com.aws.greengrass.util.platforms.unix.linux.LinuxPlatform.. {}
[...]
# The SecureTunneling component enters the running state
2022-10-21T07:21:16.609Z [INFO] (aws.greengrass.SecureTunneling-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.SecureTunneling, currentState=STARTING, newState=RUNNING}
[...]
# The SecureTunneling component tries to subscribe to the tunnel topic and an exception is raised
2022-10-21T07:21:43.803Z [ERROR] (Thread-3) com.aws.greengrass.mqttclient.MqttClient: Error subscribing. {topic=$aws/things/GreengrassTestVM/tunnels/notify}
java.util.concurrent.CompletionException: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:870)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
    at software.amazon.awssdk.crt.AsyncCallback$1.onFailure(AsyncCallback.java:32)
    at software.amazon.awssdk.crt.mqtt.MqttClientConnection.onConnectionComplete(MqttClientConnection.java:140)
Caused by: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
    ... 1 more
[... 3 addtional subscription attempts omitted ...]
# The component enters the broken state and is stuck there
2022-10-21T07:22:09.899Z [INFO] (aws.greengrass.SecureTunneling-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.SecureTunneling, currentState=RUNNING, newState=BROKEN}

I've attached both the log files of the SecureTunneling component and the greengrass system. Both logs where created at the same device, when a device was booted without an internet connection available. Later, this connection was restored. Sometime later the service was restarted via systemctl restart greengrass after the SecureTunneling component did not recover. After that the connection via secure tunneling worked.

github-securetunneling.log github-greengrass.log

Environment (please complete the following information):

Additional context

I am running greengrass and the secure tunneling component on a mobile device which is connected to the internet via a LTE connection. This connection will always take some time to established and may be interrupted during operation, at least for a short while. The secure tunneling component must thus be able to handle periods with out internet connection, after which is must be able to resume sevice. F.e. the nucleues or the cloudwatch component seems to be able to reconnect after any amount of time, which is what I would expect from the SecureTunneling component.

HarshGandhi-AWS commented 1 year ago

Hey @Staff-d , thank you for reaching out to us with this issue. Please give us some time to review this issue and we will get back to you once we have more understanding about it.

Thanks, Harsh Gandhi

HarshGandhi-AWS commented 1 year ago

Hello @Staff-d can you please update to newer version of Greengrass nucleus? It looks like us are using an older version(2.7.0).

Staff-d commented 1 year ago

Hello @HarshGandhi-AWS ,

To test if the issue still exists I created a new VM based on ubuntu 22.04 LTS, installed the newest greengrass version

java -jar ./GreengrassInstaller/lib/Greengrass.jar --version
AWS Greengrass v2.9.2

and created a deployment with the newest version of the secure tunneling component which I deployed to the VM:

{
    "targetArn": "arn:aws:iot:eu-central-1:056026206238:thing/GreengrassTestVM",
    "deploymentName": "tunnelingtestdeployment",
    "components": {
        "aws.greengrass.SecureTunneling": {
            "componentVersion": "1.0.11"
        },
        "aws.greengrass.Cli": {
            "componentVersion": "2.9.2",
            "configurationUpdate": null
        }
    },
    "deploymentPolicies": {
        "failureHandlingPolicy": "ROLLBACK",
        "componentUpdatePolicy": {
            "action": "NOTIFY_COMPONENTS",
            "timeoutInSeconds": 60
        }
    }
}

I disconnected the VM from the internet and performed a reboot (as an alternative you can also restart the greengrass service using systemctl). The component shows the same behavior as with version 2.7.0: It tries to connect to IoT core three times, enters the broken state when it fails to do so and does not recover even if the connection is restored.

2022-12-23T14:37:20.042Z [INFO] (Copier) aws.greengrass.SecureTunneling: stdout. [ERROR] 2022-12-23 15:37:20.014 [main] SecureTunnelingExecutor - Exception while running secure tunneling.. {scriptName=services.aws.greengrass.SecureTunneling.lifecycle.run.script, serviceName=aws.greengrass.SecureTunneling, currentState=RUNNING}
2022-12-23T14:37:20.043Z [INFO] (Copier) aws.greengrass.SecureTunneling: stdout. java.util.concurrent.ExecutionException: software.amazon.awssdk.aws.greengrass.model.ServiceError: Subscribe to topic $aws/things/GreengrassTestVM/tunnels/notify failed with error java.util.concurrent.ExecutionException: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.. {scriptName=services.aws.greengrass.SecureTunneling.lifecycle.run.script, serviceName=aws.greengrass.SecureTunneling, currentState=RUNNING}

sudo /greengrass/v2/bin/greengrass-cli component details -n aws.greengrass.SecureTunneling:

Dec 23, 2022 3:40:46 PM software.amazon.awssdk.eventstreamrpc.EventStreamRPCConnection$1 onConnectionSetup
INFO: Socket connection /greengrass/v2/ipc.socket:8033 to server result [AWS_ERROR_SUCCESS]
Dec 23, 2022 3:40:46 PM software.amazon.awssdk.eventstreamrpc.EventStreamRPCConnection$1 onProtocolMessage
INFO: Connection established with event stream RPC server
Component Name: aws.greengrass.SecureTunneling
    Version: 1.0.11
    State: BROKEN
    Configuration: {"accessControl":{"aws.greengrass.ipc.mqttproxy":{"aws.greengrass.SecureTunneling:mqttproxy:1":{"operations":["aws.greengrass#SubscribeToIoTCore"],"policyDescription":"Access to tunnel notification pubsub topic","resources":["$aws/things/+/tunnels/notify"]}}},"OS_DIST_INFO":"auto"}

The greengrass core software DOES recover and connects to the IoT core as soon as the internet connection comes back up (around the 15:38:38 CET/ 14:38:38 UTC mark):

2022-12-23T14:38:37.265Z [ERROR] (pool-2-thread-12) com.aws.greengrass.mqttclient.MqttClient: Error when publishing from spooler. {}
java.util.concurrent.ExecutionException: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
    at com.aws.greengrass.mqttclient.MqttClient.runSpooler(MqttClient.java:644)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: software.amazon.awssdk.crt.mqtt.MqttException: A query to dns failed to resolve.
    at software.amazon.awssdk.crt.mqtt.MqttClientConnection.onConnectionComplete(MqttClientConnection.java:140)

2022-12-23T14:38:47.934Z [INFO] (pool-2-thread-12) com.aws.greengrass.mqttclient.AwsIotMqttClient: Connecting to AWS IoT Core. {clientId=GreengrassTestVM}
2022-12-23T14:38:59.045Z [INFO] (Thread-6) com.aws.greengrass.mqttclient.AwsIotMqttClient: Connecting to AWS IoT Core. {clientId=GreengrassTestVM}
2022-12-23T14:38:59.046Z [INFO] (Thread-6) com.aws.greengrass.mqttclient.AwsIotMqttClient: Connection purposefully interrupted. {clientId=GreengrassTestVM}
2022-12-23T14:38:59.249Z [INFO] (Thread-6) com.aws.greengrass.mqttclient.AwsIotMqttClient: Successfully connected to AWS IoT Core. {clientId=GreengrassTestVM, sessionPresent=false}

Here are the greengrass and securetunneling log files for this experiment:

greengrass-log-v2.9.2.txt securetunneling-log-v2.9.2.txt

I could also provide the qcow image of the virtual machine, should you need it.

Best regards and happy new year :slightly_smiling_face: Sebastian

shangabl commented 1 year ago

Hi Sebastian,

We have deployed the latest 1.0.12 release of the Green Grass Secure Tunneling Component and that should resolve this issue!

Thanks for your help Shane