aws-greengrass / aws-greengrass-nucleus

The Greengrass nucleus component provides functionality for device side orchestration of deployments and lifecycle management for execution of Greengrass components and applications. This includes features such as starting, stopping, and monitoring execution of components and apps, interprocess communication server for communication between components, component installation and configuration management.
Apache License 2.0
109 stars 45 forks source link

nucleus: system service loader deleted if first ever deployment to thing fails #1635

Closed Dizzzmas closed 3 months ago

Dizzzmas commented 3 months ago

Describe the bug A new thing is provisioned with fleet plugin. A deployment exists for its group, so rollout starts immediately. Deployment fails for the thing. Rollback is successful. Checking /alts/current/distro directory and it's empty. Unless a follow-up successful deployment happens, greengrass won't be able to run on reboot because of missing /alts/current/distro/bin/loader.

To Reproduce Perform a failed first deployment for a newly provisioned thing. Check if /alts/current/distro/bin/loader exists after rollback.

My deployment manifest:

{
    "targetArn": "arn:aws:iot:eu-west-2:accountId:thinggroup/MyGroup",
    "deploymentName": "Test fleet deployment",
    "components": {
        "com.example.helloWorld": {
            "componentVersion": "1.0.0",
            "runWith": {}
        },
        "aws.greengrass.LogManager": {
          "componentVersion": "2.3.7",
          "configurationUpdate": {
                "merge": "{\"logsUploaderConfiguration\":{\"systemLogsConfiguration\": 
                   {\"minimumLogLevel\":\"INFO\",\"uploadToCloudWatch\":true},\"componentLogsConfigurationMap\": 
                   {\"com.example.hello\":{\"minimumLogLevel\":\"INFO\",\"uploadToCloudwatch\":true}}}}"
          }
    },
      "aws.greengrass.Cloudwatch": {
            "componentVersion": "3.1.0",
                 "runWith": {
                     "posixUser": "root"
                 }
          },
      "aws.greengrass.ShadowManager": {
            "componentVersion": "2.3.8",
                "configurationUpdate": {
                     "merge": "{\"synchronize\":{\"coreThing\":{\"classic\":true, \"namedShadows\": [\"config\"] } }}"
                 }

    }

    },
    "deploymentPolicies": {
        "failureHandlingPolicy": "ROLLBACK",
        "componentUpdatePolicy": {
            "timeoutInSeconds": 60,
            "action": "NOTIFY_COMPONENTS"
        },
        "configurationValidationPolicy": {
            "timeoutInSeconds": 60
        }
    },
    "iotJobConfiguration": {}
}

Expected behavior After rollback greengrass is not missing essential loader for system service under /alts/current/distro/bin/loader which would make greengrass unsalvageable upon reboot.

Actual behavior After rollback greengrass is missing essential loader for system service under /alts/current/distro/bin/loader and reboot would cause the system service to no longer be able to run.

Environment

Looks related to: https://github.com/aws-greengrass/aws-greengrass-nucleus/issues/1342

Dizzzmas commented 3 months ago

Here are the logs from the thing when this happens:

2024-07-09T12:10:07.664Z [INFO] (Thread-6) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Successfully connected to AWS IoT Core. {clientId=Recorder_EMUL3, sessionPresent=false}
2024-07-09T12:10:07.763Z [INFO] (main) com.aws.greengrass.lifecyclemanager.Kernel: effective-config-dump-complete. {file=/greengrass/v2/config/effectiveConfig.yaml}
2024-07-09T12:10:07.778Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: system-start. {main=services.main:NEW}
2024-07-09T12:10:07.841Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=NEW, newState=INSTALLED}
2024-07-09T12:10:07.845Z [INFO] (TelemetryAgent-lifecycle) com.aws.greengrass.telemetry.TelemetryAgent: service-set-state. {serviceName=TelemetryAgent, currentState=NEW, newState=INSTALLED}
2024-07-09T12:10:07.848Z [INFO] (FleetStatusService-lifecycle) com.aws.greengrass.status.FleetStatusService: service-set-state. {serviceName=FleetStatusService, currentState=NEW, newState=INSTALLED}
2024-07-09T12:10:07.843Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=NEW, newState=INSTALLED}
2024-07-09T12:10:07.853Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-set-state. {serviceName=UpdateSystemPolicyService, currentState=NEW, newState=INSTALLED}
2024-07-09T12:10:07.856Z [INFO] (main-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=main, currentState=NEW, newState=INSTALLED}
2024-07-09T12:10:07.913Z [INFO] (TelemetryAgent-lifecycle) com.aws.greengrass.telemetry.TelemetryAgent: service-set-state. {serviceName=TelemetryAgent, currentState=INSTALLED, newState=STARTING}
2024-07-09T12:10:07.914Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=INSTALLED, newState=STARTING}
2024-07-09T12:10:07.914Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=INSTALLED, newState=STARTING}
2024-07-09T12:10:07.915Z [INFO] (FleetStatusService-lifecycle) com.aws.greengrass.status.FleetStatusService: service-set-state. {serviceName=FleetStatusService, currentState=INSTALLED, newState=STARTING}
2024-07-09T12:10:07.921Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-set-state. {serviceName=UpdateSystemPolicyService, currentState=INSTALLED, newState=STARTING}
2024-07-09T12:10:07.943Z [INFO] (pool-3-thread-14) com.aws.greengrass.deployment.DeploymentService: Received deployment in the queue. {DeploymentId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20, GreengrassDeploymentId=edcfec89-8dbe-4653-8273-1151b2c7f109, DeploymentType=IOT_JOBS
2024-07-09T12:10:07.956Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-set-state. {serviceName=UpdateSystemPolicyService, currentState=STARTING, newState=RUNNING}
2024-07-09T12:10:07.948Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=STARTING, newState=RUNNING}
2024-07-09T12:10:07.968Z [INFO] (FleetStatusService-lifecycle) com.aws.greengrass.status.FleetStatusService: service-set-state. {serviceName=FleetStatusService, currentState=STARTING, newState=RUNNING}
2024-07-09T12:10:07.974Z [INFO] (TelemetryAgent-lifecycle) com.aws.greengrass.telemetry.TelemetryAgent: service-set-state. {serviceName=TelemetryAgent, currentState=STARTING, newState=RUNNING}
2024-07-09T12:10:07.979Z [INFO] (pool-3-thread-11) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-finished. Nothing done. {serviceName=aws.greengrass.Nucleus, currentState=STARTING}
2024-07-09T12:10:07.981Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=STARTING, newState=FINISHED}
2024-07-09T12:10:08.025Z [INFO] (main-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=main, currentState=INSTALLED, newState=STARTING}
2024-07-09T12:10:08.030Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=FINISHED, newState=STOPPING}
2024-07-09T12:10:08.061Z [INFO] (pool-3-thread-12) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-finished. Nothing done. {serviceName=main, currentState=STARTING}
2024-07-09T12:10:08.063Z [INFO] (main-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=main, currentState=STARTING, newState=FINISHED}
2024-07-09T12:10:08.065Z [INFO] (main-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=main, currentState=FINISHED, newState=STOPPING}
2024-07-09T12:10:08.069Z [INFO] (pool-3-thread-12) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown initiated. {serviceName=main, currentState=STOPPING}
2024-07-09T12:10:08.075Z [INFO] (pool-3-thread-11) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown initiated. {serviceName=aws.greengrass.Nucleus, currentState=STOPPING}
2024-07-09T12:10:08.082Z [INFO] (pool-3-thread-12) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-shutdown. {serviceName=main, currentState=STOPPING}
2024-07-09T12:10:08.084Z [INFO] (main-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=main, currentState=STOPPING, newState=FINISHED}
2024-07-09T12:10:08.086Z [INFO] (pool-3-thread-11) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-shutdown. {serviceName=aws.greengrass.Nucleus, currentState=STOPPING}
2024-07-09T12:10:08.087Z [INFO] (aws.greengrass.Nucleus-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=aws.greengrass.Nucleus, currentState=STOPPING, newState=FINISHED}
2024-07-09T12:10:08.260Z [INFO] (pool-3-thread-14) com.aws.greengrass.deployment.DeploymentStatusKeeper: Stored deployment status. {DeploymentId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20, GreengrassDeploymentId=edcfec89-8dbe-4653-8273-1151b2c7f109, DeploymentStatus=IN_PROGRE
2024-07-09T12:10:08.266Z [INFO] (main) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=NUCLEUS_LAUNCH, serviceName=FleetStatusService, currentState=RUNNING}
2024-07-09T12:10:08.296Z [INFO] (pool-3-thread-14) com.aws.greengrass.deployment.IotJobsHelper: Updating status of persisted deployment. {Status=IN_PROGRESS, StatusDetails={}, ThingName=Recorder_EMUL3, JobId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20}
2024-07-09T12:10:08.391Z [INFO] (Thread-6) com.aws.greengrass.deployment.IotJobsHelper: Received Iot job description. {Status=IN_PROGRESS, queueAt=Tue Jul 09 12:07:33 UTC 2024, ThingName=Recorder_EMUL3, JobId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20}
2024-07-09T12:10:08.393Z [INFO] (Thread-6) com.aws.greengrass.deployment.IotJobsHelper: Duplicate or outdated job notification. Ignoring.. {ThingName=Recorder_EMUL3, JobId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20}
2024-07-09T12:10:08.591Z [INFO] (Thread-6) com.aws.greengrass.deployment.IotJobsHelper: Job status update was accepted. {Status=IN_PROGRESS, ThingName=Recorder_EMUL3, JobId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20}
2024-07-09T12:10:08.679Z [INFO] (pool-3-thread-14) com.aws.greengrass.deployment.DeploymentService: Started deployment execution. {serviceName=DeploymentService, currentState=RUNNING, deployment=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20}
2024-07-09T12:10:08.857Z [INFO] (pool-3-thread-7) com.aws.greengrass.componentmanager.ComponentStore: delete-component-finish. {componentIdentifier=aws.greengrass.TokenExchangeService-v2.0.3}
2024-07-09T12:10:08.963Z [INFO] (pool-3-thread-7) com.aws.greengrass.componentmanager.ComponentStore: delete-component-finish. {componentIdentifier=aws.greengrass.ShadowManager-v2.3.8}
2024-07-09T12:10:09.027Z [INFO] (pool-3-thread-7) com.aws.greengrass.componentmanager.ComponentStore: delete-component-finish. {componentIdentifier=com.example.hello-v1.0.60}
2024-07-09T12:10:09.064Z [INFO] (pool-3-thread-7) com.aws.greengrass.componentmanager.ComponentStore: delete-component-finish. {componentIdentifier=aws.greengrass.LogManager-v2.3.7}
2024-07-09T12:10:09.163Z [INFO] (pool-3-thread-7) com.aws.greengrass.componentmanager.ComponentStore: delete-component-finish. {componentIdentifier=aws.greengrass.Cloudwatch-v3.1.0}
2024-07-09T12:10:09.325Z [INFO] (pool-3-thread-7) com.aws.greengrass.componentmanager.ComponentStore: delete-component-finish. {componentIdentifier=aws.greengrass.Nucleus-v2.12.6}
2024-07-09T12:10:10.127Z [INFO] (pool-2-thread-3) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=COMPONENT_STATUS_CHANGE, serviceName=FleetStatusService, currentState=RUNNING}
2024-07-09T12:10:23.683Z [INFO] (pool-3-thread-14) com.aws.greengrass.deployment.DeploymentService: Current deployment finished. {DeploymentId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20, GreengrassDeploymentId=edcfec89-8dbe-4653-8273-1151b2c7f109, serviceName=DeploymentServic
2024-07-09T12:10:23.739Z [ERROR] (pool-3-thread-14) com.aws.greengrass.deployment.DeploymentService: Deployment task failed with following errors. {DeploymentId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20, detailed-deployment-status=FAILED_ROLLBACK_COMPLETE, deployment-error-t
com.aws.greengrass.deployment.exceptions.DeploymentException: Service com.example.hello in broken state after deployment
    at com.aws.greengrass.deployment.KernelUpdateDeploymentTask.getDeploymentStatusDetails(KernelUpdateDeploymentTask.java:142)
    at com.aws.greengrass.deployment.KernelUpdateDeploymentTask.call(KernelUpdateDeploymentTask.java:76)
    at com.aws.greengrass.deployment.KernelUpdateDeploymentTask.call(KernelUpdateDeploymentTask.java:36)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

2024-07-09T12:10:23.869Z [INFO] (pool-3-thread-14) com.aws.greengrass.deployment.DeploymentStatusKeeper: Stored deployment status. {DeploymentId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20, GreengrassDeploymentId=edcfec89-8dbe-4653-8273-1151b2c7f109, DeploymentStatus=FAILED}
2024-07-09T12:10:23.878Z [INFO] (pool-3-thread-14) com.aws.greengrass.deployment.IotJobsHelper: Updating status of persisted deployment. {Status=FAILED, StatusDetails={deployment-error-stack=[DEPLOYMENT_FAILURE, COMPONENT_UPDATE_ERROR, COMPONENT_BROKEN], deployment-erro
2024-07-09T12:10:23.983Z [INFO] (Thread-6) com.aws.greengrass.deployment.IotJobsHelper: Job status update was accepted. {Status=FAILED, ThingName=Recorder_EMUL3, JobId=a0dd6d9a-b9d0-4bf2-986f-11710f1e1a20}
2024-07-09T12:10:24.131Z [INFO] (pool-3-thread-14) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=THING_GROUP_DEPLOYMENT, serviceName=FleetStatusService, currentState=RUNNING}
2024-07-09T12:10:24.143Z [INFO] (pool-3-thread-14) com.aws.greengrass.deployment.DeploymentDirectoryManager: Persist link to last deployment. {link=/greengrass/v2/deployments/previous-failure}
2024-07-09T12:10:24.499Z [INFO] (Thread-6) com.aws.greengrass.deployment.IotJobsHelper: Received empty jobs in notification . {ThingName=Recorder_EMUL3}
Dizzzmas commented 3 months ago

I guess the culprit is this line in the logs which deletes nucleus:

com.aws.greengrass.componentmanager.ComponentStore: delete-component-finish. {componentIdentifier=aws.greengrass.Nucleus-v2.12.6}

I've revised the deployment a couple times and seems as soon as I include any aws-owned component like shadow manager or log manager, nucleus gets deleted on rollback.

MikeDombo commented 3 months ago

How did you install Greengrass? What are the commands you ran and what was the configuration file that you used?

This will happen if you do not put the correct version number of Nucleus in the initial configuration.

Dizzzmas commented 3 months ago

Thanks @MikeDombo! You rock! That was my exact issue. The discrepancy between the installed nucleus version 2.12.6 and the one specified in config.yaml 2.12.4 . After updating config.yaml to have matching version, deployment rollback no longer kills nucleus.

This happened because when building my OS image that includes greengrass I was downloading the latest version:

curl -s https://d2s8p88vqu9w66.cloudfront.net/releases/greengrass-nucleus-latest.zip > greengrass-nucleus-latest.zip

But then forgot to update the code that was patching it with config.yaml that was created quite some time ago and was assuming version 2.12.4

Will make sure to download fixed version of greengrass.

MikeDombo commented 3 months ago

Glad that fixed it! Definitely pinning to a specific version in your image is the safest thing to do.