aws-greengrass / aws-greengrass-shadow-manager

A GreengrassV2 Component that provides offline device shadow documents and optional synchronization to the IoT device shadow service.
Apache License 2.0
9 stars 5 forks source link

Shadow Manager Exception when writing to DB #186

Closed pwolfe1 closed 1 year ago

pwolfe1 commented 1 year ago

Describe the bug I have had the latest ShadowManager and Nucleus deployed on a device I have been developing on for a few days. Today, all my components that relied on reading shadows failed due to the ShadowManager component entering the broken state. I was able to get a stack trace from Greengrass's logs that show that the ShadowManager was broken because of an exception thrown here. See the logs pasted below for more details. Because shadows are a core part of our system this render much of the device's functionality broken and is considered a critical failure.

To Reproduce I was not able to reproduce this issue although it persisted until I downgraded my Nucleus back to 2.9.5 and Shadow Manager to 2.2.4. After deploying those versions the shadows became synced between the cloud and device and read/write-able as expected.

Rebooting the device, power cycling the device, systemctl restart-ing greengrass, reinstalling the ShadowManager component via the LocalDebugConsole, re-deploying the same version of ShadowManager, and redeploying the components without ShadowManager then re-adding it in a second deployment all yielded the same results (logs shown below).

Note: This broken state was first incurred during the 6/13 us-east-1 outage. My device is on us-east-1.

Environment

Greengrass logs captured after hitting "Reinstall" on the local debug console. The exception is thrown 3 times and then the component enters BROKEN.

2023-06-13T20:35:28.417Z [INFO] (WebSocketWorker-62) com.aws.greengrass.localdebugconsole.SimpleHttpServer: Requested reinstall from aws.greengrass.ShadowManager. {serviceName=aws.greengrass.LocalDebugConsole, currentState=RUNNING}
2023-06-13T20:35:28.418Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=BROKEN, newState=NEW}
2023-06-13T20:35:28.460Z [INFO] (pool-2-thread-23) org.flywaydb.core.internal.database.DatabaseFactory: Database: jdbc:h2:/greengrass/v2/work/aws.greengrass.ShadowManager/shadow (H2 1.4). {}
2023-06-13T20:35:28.522Z [INFO] (pool-2-thread-23) org.flywaydb.core.internal.command.DbValidate: Successfully validated 1 migration (execution time 00:00.049s). {}
2023-06-13T20:35:28.563Z [INFO] (pool-2-thread-23) org.flywaydb.core.internal.command.DbMigrate: Current version of schema "PUBLIC": 1. {}
2023-06-13T20:35:28.572Z [INFO] (pool-2-thread-23) org.flywaydb.core.internal.command.DbMigrate: Schema "PUBLIC" is up to date. No migration necessary.. {}
2023-06-13T20:35:28.645Z [INFO] (pool-2-thread-23) com.aws.greengrass.shadowmanager.ShadowManager: config. {syncStrategy=realTime, serviceName=aws.greengrass.ShadowManager, currentState=NEW}
2023-06-13T20:35:28.645Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=NEW, newState=INSTALLED}
2023-06-13T20:35:28.659Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=INSTALLED, newState=STARTING}
2023-06-13T20:35:28.663Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=STARTING, newState=RUNNING}
2023-06-13T20:35:28.700Z [ERROR] (pool-2-thread-24) com.aws.greengrass.shadowmanager.ShadowManager: service-errored. {serviceName=aws.greengrass.ShadowManager, currentState=RUNNING}
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@65eae692[Not completed, task = com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl$$Lambda$469/0x000000010044c840@484c05b7] rejected from java.util.concurrent.ThreadPoolExecutor@112efa7a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
    at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
    at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
    at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
    at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
    at com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl.executeWriteOperation(ShadowManagerDAOImpl.java:373)
    at com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl.insertSyncInfoIfNotExists(ShadowManagerDAOImpl.java:343)
    at com.aws.greengrass.shadowmanager.ShadowManager.insertSyncInfoIfNotPresent(ShadowManager.java:437)
    at com.aws.greengrass.shadowmanager.ShadowManager.initializeSyncInfo(ShadowManager.java:432)
    at com.aws.greengrass.shadowmanager.ShadowManager.startSyncingShadows(ShadowManager.java:537)
    at com.aws.greengrass.shadowmanager.ShadowManager.setupSync(ShadowManager.java:396)
    at com.aws.greengrass.shadowmanager.ShadowManager.startup(ShadowManager.java:480)
    at com.aws.greengrass.lifecyclemanager.Lifecycle.lambda$handleStateTransitionStartingToRunningAsync$9(Lifecycle.java:591)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

2023-06-13T20:35:28.702Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=RUNNING, newState=ERRORED}
2023-06-13T20:35:28.735Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=ERRORED_COMPONENT, serviceName=FleetStatusService, currentState=RUNNING}
2023-06-13T20:35:28.738Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=ERRORED, newState=STOPPING}
2023-06-13T20:35:28.799Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=STOPPING, newState=INSTALLED}
2023-06-13T20:35:28.803Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=INSTALLED, newState=STARTING}
2023-06-13T20:35:28.816Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=STARTING, newState=RUNNING}
2023-06-13T20:35:28.831Z [ERROR] (pool-2-thread-24) com.aws.greengrass.shadowmanager.ShadowManager: service-errored. {serviceName=aws.greengrass.ShadowManager, currentState=RUNNING}
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@68ea093a[Not completed, task = com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl$$Lambda$469/0x000000010044c840@60033dd5] rejected from java.util.concurrent.ThreadPoolExecutor@112efa7a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
    at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
    at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
    at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
    at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
    at com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl.executeWriteOperation(ShadowManagerDAOImpl.java:373)
    at com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl.insertSyncInfoIfNotExists(ShadowManagerDAOImpl.java:343)
    at com.aws.greengrass.shadowmanager.ShadowManager.insertSyncInfoIfNotPresent(ShadowManager.java:437)
    at com.aws.greengrass.shadowmanager.ShadowManager.initializeSyncInfo(ShadowManager.java:432)
    at com.aws.greengrass.shadowmanager.ShadowManager.startSyncingShadows(ShadowManager.java:537)
    at com.aws.greengrass.shadowmanager.ShadowManager.setupSync(ShadowManager.java:396)
    at com.aws.greengrass.shadowmanager.ShadowManager.startup(ShadowManager.java:480)
    at com.aws.greengrass.lifecyclemanager.Lifecycle.lambda$handleStateTransitionStartingToRunningAsync$9(Lifecycle.java:591)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

2023-06-13T20:35:28.832Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=RUNNING, newState=ERRORED}
2023-06-13T20:35:28.834Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=ERRORED_COMPONENT, serviceName=FleetStatusService, currentState=RUNNING}
2023-06-13T20:35:28.839Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=ERRORED, newState=STOPPING}
2023-06-13T20:35:28.864Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=STOPPING, newState=INSTALLED}
2023-06-13T20:35:28.867Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=INSTALLED, newState=STARTING}
2023-06-13T20:35:28.871Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=STARTING, newState=RUNNING}
2023-06-13T20:35:28.901Z [ERROR] (pool-2-thread-24) com.aws.greengrass.shadowmanager.ShadowManager: service-errored. {serviceName=aws.greengrass.ShadowManager, currentState=RUNNING}
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@19487583[Not completed, task = com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl$$Lambda$469/0x000000010044c840@4a4bc7fd] rejected from java.util.concurrent.ThreadPoolExecutor@112efa7a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
    at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
    at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
    at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
    at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
    at com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl.executeWriteOperation(ShadowManagerDAOImpl.java:373)
    at com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl.insertSyncInfoIfNotExists(ShadowManagerDAOImpl.java:343)
    at com.aws.greengrass.shadowmanager.ShadowManager.insertSyncInfoIfNotPresent(ShadowManager.java:437)
    at com.aws.greengrass.shadowmanager.ShadowManager.initializeSyncInfo(ShadowManager.java:432)
    at com.aws.greengrass.shadowmanager.ShadowManager.startSyncingShadows(ShadowManager.java:537)
    at com.aws.greengrass.shadowmanager.ShadowManager.setupSync(ShadowManager.java:396)
    at com.aws.greengrass.shadowmanager.ShadowManager.startup(ShadowManager.java:480)
    at com.aws.greengrass.lifecyclemanager.Lifecycle.lambda$handleStateTransitionStartingToRunningAsync$9(Lifecycle.java:591)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

2023-06-13T20:35:28.902Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.shadowmanager.ShadowManager: service-set-state. {serviceName=aws.greengrass.ShadowManager, currentState=RUNNING, newState=BROKEN}
2023-06-13T20:35:28.904Z [INFO] (aws.greengrass.ShadowManager-lifecycle) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=BROKEN_COMPONENT, serviceName=FleetStatusService, currentState=RUNNING}

I am not sure why the pool size for the dbWriteThreadPool (as defined here) is logged as 0.

MikeDombo commented 1 year ago

Please restart Greengrass, do not use the local debug console to reinstall Shadow manager. Does the issue happen?

pwolfe1 commented 1 year ago

I unfortunately don't have a reproduction but when it was in the bad state yesterday restarting Greengrass did not resolve the problem.

I see there's a PR up now that is intended to address this so feel free to close the issue.

MikeDombo commented 1 year ago

Restarting Greengrass would solve this particular issue.

If you have logs from the issue which didn't involve using local debug console to reinstall, I'd be very interested.

pwolfe1 commented 1 year ago

If you have logs from the issue which didn't involve using local debug console to reinstall, I'd be very interested.

Here are some more logs:

This is what greengrass.log contained when my components weren't able to read shadows locally:

2023-06-13T23:23:21.488Z [INFO] (Thread-11) software.amazon.awssdk.eventstreamrpc.RpcServer: New connection code [AWS_ERROR_SUCCESS] for [Id 37, Class ServerConnection, Refs 1](2023-06-13T23:23:21.487736Z) - <null>. {}
2023-06-13T23:23:21.509Z [INFO] (Thread-11) software.amazon.awssdk.eventstreamrpc.ServiceOperationMappingContinuationHandler: aws.greengrass#GreengrassCoreIPC authenticated identity: PatricksComponent. {}
2023-06-13T23:23:21.510Z [INFO] (Thread-11) software.amazon.awssdk.eventstreamrpc.ServiceOperationMappingContinuationHandler: Connection accepted for PatricksComponent. {}
2023-06-13T23:23:21.512Z [INFO] (Thread-11) software.amazon.awssdk.eventstreamrpc.ServiceOperationMappingContinuationHandler: Sending connect response for PatricksComponent. {}
2023-06-13T23:23:21.620Z [ERROR] (Thread-11) com.aws.greengrass.ipc.common.ExceptionUtil: Unhandled exception in IPC. {}
java.lang.NullPointerException
        at com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl.getShadowThing(ShadowManagerDAOImpl.java:61)
        at com.aws.greengrass.shadowmanager.ipc.GetThingShadowRequestHandler.lambda$handleRequest$0(GetThingShadowRequestHandler.java:89)
        at com.aws.greengrass.ipc.common.ExceptionUtil.translateExceptions(ExceptionUtil.java:33)
        at com.aws.greengrass.shadowmanager.ipc.GetThingShadowRequestHandler.handleRequest(GetThingShadowRequestHandler.java:75)
        at com.aws.greengrass.shadowmanager.ipc.GetThingShadowIPCHandler.handleRequest(GetThingShadowIPCHandler.java:82)
        at com.aws.greengrass.shadowmanager.ipc.GetThingShadowIPCHandler.handleRequest(GetThingShadowIPCHandler.java:28)
        at software.amazon.awssdk.eventstreamrpc.OperationContinuationHandler.onContinuationMessage(OperationContinuationHandler.java:291)
        at software.amazon.awssdk.crt.eventstream.ServerConnectionContinuationHandler.onContinuationMessageShim(ServerConnectionContinuationHandler.java:53)

2023-06-13T23:23:21.643Z [ERROR] (Thread-11) com.aws.greengrass.ipc.common.ExceptionUtil: Unhandled exception in IPC. {}
java.lang.NullPointerException
        at com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl.getShadowThing(ShadowManagerDAOImpl.java:61)
        at com.aws.greengrass.shadowmanager.ipc.GetThingShadowRequestHandler.lambda$handleRequest$0(GetThingShadowRequestHandler.java:89)
        at com.aws.greengrass.ipc.common.ExceptionUtil.translateExceptions(ExceptionUtil.java:33)
        at com.aws.greengrass.shadowmanager.ipc.GetThingShadowRequestHandler.handleRequest(GetThingShadowRequestHandler.java:75)
        at com.aws.greengrass.shadowmanager.ipc.GetThingShadowIPCHandler.handleRequest(GetThingShadowIPCHandler.java:82)
        at com.aws.greengrass.shadowmanager.ipc.GetThingShadowIPCHandler.handleRequest(GetThingShadowIPCHandler.java:28)
        at software.amazon.awssdk.eventstreamrpc.OperationContinuationHandler.onContinuationMessage(OperationContinuationHandler.java:291)
        at software.amazon.awssdk.crt.eventstream.ServerConnectionContinuationHandler.onContinuationMessageShim(ServerConnectionContinuationHandler.java:53)

This is what I saw when restarting, then re-deploying Nucleus, ShadowManager, LocalDebugConsole, and the CLI:

2023-06-14T04:02:20.567Z [INFO] (AwsEventLoop 2) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Successfully connected to AWS IoT Core. {clientId=3343ffe8-78d0-4d97-ae77-0fd6c1f702b5, sessionPresent=false}
2023-06-14T04:02:20.678Z [INFO] (main) com.aws.greengrass.lifecyclemanager.Kernel: effective-config-dump-complete. {file=/greengrass/v2/config/effectiveConfig.yaml}
2023-06-14T04:02:20.679Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: system-start. {main=services.main:NEW}
2023-06-14T04:02:20.689Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=NEW, newState=INSTALLED}
2023-06-14T04:02:20.690Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=NEW, newState=INSTALLED}
2023-06-14T04:02:20.691Z [INFO] (aws.greengrass.LocalDebugConsole-lifecycle) com.aws.greengrass.localdebugconsole.SimpleHttpServer: service-set-state. {serviceName=aws.greengrass.LocalDebugConsole, currentState=NEW, newState=INSTALLED}
2023-06-14T04:02:20.692Z [INFO] (TelemetryAgent-lifecycle) com.aws.greengrass.telemetry.TelemetryAgent: service-set-state. {serviceName=TelemetryAgent, currentState=NEW, newState=INSTALLED}
2023-06-14T04:02:20.692Z [INFO] (FleetStatusService-lifecycle) com.aws.greengrass.status.FleetStatusService: service-set-state. {serviceName=FleetStatusService, currentState=NEW, newState=INSTALLED}
2023-06-14T04:02:20.696Z [INFO] (main-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=main, currentState=NEW, newState=INSTALLED}
2023-06-14T04:02:20.698Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-set-state. {serviceName=UpdateSystemPolicyService, currentState=NEW, newState=INSTALLED}
2023-06-14T04:02:20.698Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=INSTALLED, newState=STARTING}
2023-06-14T04:02:20.719Z [INFO] (TelemetryAgent-lifecycle) com.aws.greengrass.telemetry.TelemetryAgent: service-set-state. {serviceName=TelemetryAgent, currentState=INSTALLED, newState=STARTING}
2023-06-14T04:02:20.729Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=INSTALLED, newState=STARTING}
2023-06-14T04:02:20.730Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-set-state. {serviceName=UpdateSystemPolicyService, currentState=INSTALLED, newState=STARTING}
2023-06-14T04:02:20.731Z [INFO] (FleetStatusService-lifecycle) com.aws.greengrass.status.FleetStatusService: service-set-state. {serviceName=FleetStatusService, currentState=INSTALLED, newState=STARTING}
2023-06-14T04:02:20.733Z [INFO] (aws.greengrass.LocalDebugConsole-lifecycle) com.aws.greengrass.localdebugconsole.SimpleHttpServer: service-set-state. {serviceName=aws.greengrass.LocalDebugConsole, currentState=INSTALLED, newState=STARTING}
2023-06-14T04:02:20.739Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=STARTING, newState=RUNNING}
2023-06-14T04:02:20.740Z [INFO] (TelemetryAgent-lifecycle) com.aws.greengrass.telemetry.TelemetryAgent: service-set-state. {serviceName=TelemetryAgent, currentState=STARTING, newState=RUNNING}
2023-06-14T04:02:20.743Z [INFO] (pool-2-thread-13) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-finished. Nothing done. {serviceName=aws.greengrass.Nucleus, currentState=STARTING}
2023-06-14T04:02:20.743Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=STARTING, newState=FINISHED}
2023-06-14T04:02:20.745Z [INFO] (FleetStatusService-lifecycle) com.aws.greengrass.status.FleetStatusService: service-set-state. {serviceName=FleetStatusService, currentState=STARTING, newState=RUNNING}
2023-06-14T04:02:20.748Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-set-state. {serviceName=UpdateSystemPolicyService, currentState=STARTING, newState=RUNNING}
2023-06-14T04:02:20.752Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=FINISHED, newState=STOPPING}
2023-06-14T04:02:20.759Z [INFO] (pool-2-thread-20) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown initiated. {serviceName=aws.greengrass.Nucleus, currentState=STOPPING}
2023-06-14T04:02:20.763Z [INFO] (pool-2-thread-20) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-shutdown. {serviceName=aws.greengrass.Nucleus, currentState=STOPPING}
2023-06-14T04:02:20.764Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=STOPPING, newState=FINISHED}
2023-06-14T04:02:20.804Z [INFO] (main) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=NUCLEUS_LAUNCH, serviceName=FleetStatusService, currentState=RUNNING}
2023-06-14T04:02:20.810Z [INFO] (pool-2-thread-15) org.flywaydb.core.internal.license.VersionPrinter: Flyway Community Edition 6.5.5 by Redgate. {}
2023-06-14T04:02:20.858Z [INFO] (AwsEventLoop 2) com.aws.greengrass.deployment.IotJobsHelper: No deployment job found. {ThingName=3343ffe8-78d0-4d97-ae77-0fd6c1f702b5}
2023-06-14T04:02:20.994Z [INFO] (AwsEventLoop 2) com.aws.greengrass.deployment.ShadowDeploymentListener: Deployment result already reported. Ignoring shadow update at startup. {CONFIGURATION_ARN=arn:aws:greengrass:us-east-1:379266526677:configuration:thing/3343ffe8-78d0-4d97-ae77-0fd6c1f702b5:16}
2023-06-14T04:02:21.580Z [INFO] (pool-2-thread-15) org.flywaydb.core.internal.database.DatabaseFactory: Database: jdbc:h2:/greengrass/v2/work/aws.greengrass.ShadowManager/shadow (H2 1.4). {}
2023-06-14T04:02:21.993Z [INFO] (pool-2-thread-15) org.flywaydb.core.internal.command.DbValidate: Successfully validated 1 migration (execution time 00:00.144s). {}
2023-06-14T04:02:22.085Z [INFO] (pool-2-thread-15) org.flywaydb.core.internal.command.DbMigrate: Current version of schema "PUBLIC": 1. {}
2023-06-14T04:02:22.105Z [INFO] (pool-2-thread-15) org.flywaydb.core.internal.command.DbMigrate: Schema "PUBLIC" is up to date. No migration necessary.. {}
2023-06-14T04:02:22.387Z [ERROR] (pool-2-thread-15) com.aws.greengrass.shadowmanager.ShadowManager: service-errored. {serviceName=aws.greengrass.ShadowManager, currentState=NEW}
java.lang.IllegalArgumentException: configuration in services.aws.greengrass.ShadowManager is already a leaf, cannot become a container
        at com.aws.greengrass.config.Topics.createInteriorChild(Topics.java:165)
        at com.aws.greengrass.config.Topics.createInteriorChild(Topics.java:151)
        at com.aws.greengrass.config.Topics.lookupTopics(Topics.java:233)
        at com.aws.greengrass.config.Topics.lookupTopics(Topics.java:219)
        at com.aws.greengrass.shadowmanager.ShadowManager.install(ShadowManager.java:245)
        at com.aws.greengrass.shadowmanager.ShadowManager.install(ShadowManager.java:225)
        at com.aws.greengrass.lifecyclemanager.Lifecycle.lambda$handleCurrentStateNew$5(Lifecycle.java:502)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

For this log, ShadowManager was showing up as in the RUNNING state but IPC to it to retrieve shadows failed. Reinstalling via the debug console led to the stack trace that was posted in the OP.

MikeDombo commented 1 year ago

The nullpointerexception that you saw should only happen when shadow manager isn't running yet. This will be the case if your components which are using IPC to interact with shadow manager do not have a dependency on shadow manager in their recipe, thus allowing them to startup before shadow manager is running. At the time that happened, did your component declare a dependency on shadow manager in its recipe?

The second error configuration in services.aws.greengrass.ShadowManager is already a leaf shouldn't ever happen under normal circumstances. It would lead me to believe that you possibly tried to modify the configuration of shadow manager via the local debug console in an incorrect way. Basically that error is saying that we expect "configuration" is a map, but in this case it wasn't a map but an individual value like a string or a number.

pwolfe1 commented 1 year ago

At the time that happened, did your component declare a dependency on shadow manager in its recipe?

No, so that would plausibly explain that stack trace.

It would lead me to believe that you possibly tried to modify the configuration of shadow manager via the local debug console in an incorrect way. Basically that error is saying that we expect "configuration" is a map, but in this case it wasn't a map but an individual value like a string or a number.

This is possible as I was messing around with the configuration a bunch, although the fact that it was after a deployment makes me think the configuration should have been in an okay state. Considering the stack trace is config related and I was messing around with the configuration I don't think we should put too much stock in it though.

What I will do is upgrade my deployments back up to the latest Nucleus and ShadowManager and see if I can re-trigger this state. That way we can try things directly and record precisely the logs we want / need to debug.

Thanks for the help on this Mike

pwolfe1 commented 1 year ago

After upgrading back to Nucleus 2.10.2 and the corresponding other latest versions this failure does not appear to be happening.

MikeDombo commented 1 year ago

Thanks for verifying.

Given all that information and the PR, I will now close this issue.

pwolfe1 commented 1 year ago

@MikeDombo

I actually was able to reproduce the issue by attempting to reinstall ShadowManager 2.3.2 from the local debug console. I think you are correct that this was always the source of the stack trace I initially sent, as even when starting from a "working" starting point attempting to reinstall a ShadowManager that is on v2.3.2 leads to the exception (this is in line with what #187 addresses).

I believe what happened yesterday is that I was not aware of this interaction between the local debug console reinstall and ShadowManager 2.3.2, as on earlier versions it did not happen. So I was redeploying / restarting and then verifying if it was working or not via clicking reinstall - obviously this "verification" was actually what was causing the breakage in the first place.

I think the initial breakage may have come from the us-east-1 outage based on the timing (or at least one of my components broke which led to me trying to reinstall things). You also correctly pointed out that my components need a dependency on ShadowManager which I also believe contributed to some of the errors I observed.

Given what we have discussed and what I've observed I am relatively satisfied that #187 resolves the actual bug here.

Apologies for the somewhat confusing series of logs and thank you for the patience and speedy resolution!

MikeDombo commented 1 year ago

Yes, reinstallation is expected to fail, you shouldn't do that. It is addressed by the PR.