aws-greengrass / aws-greengrass-nucleus

The Greengrass nucleus component provides functionality for device side orchestration of deployments and lifecycle management for execution of Greengrass components and applications. This includes features such as starting, stopping, and monitoring execution of components and apps, interprocess communication server for communication between components, component installation and configuration management.
Apache License 2.0
108 stars 46 forks source link

(Nucleus): component is not gracefully terminated when a new version is deployed #1667

Open timvlaer opened 7 hours ago

timvlaer commented 7 hours ago

Describe the bug When I deploy a new version of my component, the currently running version of my component is not nicely shutdown but immediately killed.

To Reproduce

  1. Make a component with a signal handler
  2. Deploy the component
  3. Update the component version and see which signals comes out. Check the time between SIGTERM and SIGKILL.

Expected behavior I expected to get a SIGTERM signal first and then after a while a hard SIGKILL.

Actual behavior The application is immediately killed. The logs say they send a sigterm (force=false) and then a sigkill (force=true) but I don't see the SIGTERM. It doesn't look like a get a SIGTERM or I don't have the time to react to it.

Environment

Additional context In the debug logs, a couple of things seems weird to me:

2024-11-14T11:31:28.998Z [INFO] (Serialized listener processor) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-config-change. Requesting restart for component. {configNode=services.com.bf.ddd.Ble.lifecycle.Run.Setenv.PYTHONPATH, serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:28.999Z [INFO] (pool-2-thread-25) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown initiated. {serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.001Z [INFO] (pool-2-thread-25) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutting down process ["bluetoothctl power on\npython3 -u  /data/greengrass/v2/packages/artifacts-unarc..."]. {serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.012Z [DEBUG] (pool-2-thread-25) org.zeroturnaround.process.PidUtil: Found PID for Process[pid=4479, exitValue="not exited"]: 4479. {}
2024-11-14T11:31:29.014Z [INFO] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Killing child processes of pid 4479, force is false. {}
2024-11-14T11:31:29.017Z [DEBUG] (pool-2-thread-25) org.zeroturnaround.process.PidUtil: Found PID for Process[pid=4479, exitValue="not exited"]: 4479. {}
2024-11-14T11:31:29.126Z [INFO] (Serialized listener processor) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-config-change. Requesting reinstallation for component. {configNode=services.com.bf.ddd.Ble.version, serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.143Z [DEBUG] (pool-2-thread-24) com.aws.greengrass.deployment.activator.DeploymentActivator: merge-config. Applied new service config. Waiting for services to complete update. {serviceToTrack=[services.aws.greengrass.ShadowManager, services.aws.greengrass.DiskSpooler, services.aws.greengrass.Nucleus:FINISHED, services.aws.greengrass.LogManager, services.com.bf.ddd.Ble:STOPPING], mergeTime=1731583888628}
2024-11-14T11:31:29.410Z [DEBUG] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Found children of 4479. []. {}
2024-11-14T11:31:29.413Z [DEBUG] (pool-2-thread-25) org.zeroturnaround.process.PidUtil: Found PID for Process[pid=4479, exitValue="not exited"]: 4479. {}
2024-11-14T11:31:29.414Z [INFO] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Killing child processes of pid 4479, force is true. {}
2024-11-14T11:31:29.416Z [DEBUG] (pool-2-thread-25) org.zeroturnaround.process.PidUtil: Found PID for Process[pid=4479, exitValue="not exited"]: 4479. {}
2024-11-14T11:31:29.766Z [DEBUG] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Found children of 4479. []. {}
2024-11-14T11:31:29.773Z [DEBUG] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: Killing pid 4479 with signal 9 using kill -9 4479. {}
2024-11-14T11:31:29.782Z [DEBUG] (AwsEventLoop 1) software.amazon.awssdk.eventstreamrpc.OperationContinuationHandler: aws.greengrass#SubscribeToIoTCore stream continuation closed.. {}
2024-11-14T11:31:29.789Z [DEBUG] (AwsEventLoop 1) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Unsubscribing from topic. {clientId=Dock-d49cdd487732, topic=bf/Dock-d49cdd487732/measurements/openmhealth/+/accepted}
2024-11-14T11:31:29.793Z [DEBUG] (AwsEventLoop 1) software.amazon.awssdk.eventstreamrpc.OperationContinuationHandler: aws.greengrass#SubscribeToTopic stream continuation closed.. {}
2024-11-14T11:31:29.795Z [DEBUG] (AwsEventLoop 1) com.aws.greengrass.builtin.services.pubsub.PubSubIPCEventStreamAgent: Unsubscribed from topic $aws/things/Dock-d49cdd487732/shadow/name/linked-devices/update/accepted. {componentName=com.bf.ddd.Ble}
2024-11-14T11:31:29.798Z [INFO] (AwsEventLoop 1) software.amazon.awssdk.eventstreamrpc.RpcServer: Server connection closed code [socket is closed.]: [Id 37, Class ServerConnection, Refs 1](2024-11-14T11:25:49.483789Z) - <null>. {}
2024-11-14T11:31:29.842Z [WARN] (pool-2-thread-25) com.aws.greengrass.util.platforms.Platform: kill exited non-zero (process not found or other error). {stdout=, pid=4479, exit-code=1, stderr=kill: (4479): No such process}
2024-11-14T11:31:29.847Z [INFO] (pool-2-thread-25) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown completed for process ["bluetoothctl power on\npython3 -u  /data/greengrass/v2/packages/artifacts-unarc..."]. {serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.849Z [INFO] (pool-2-thread-25) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-shutdown. {serviceName=com.bf.ddd.Ble, currentState=STOPPING}
2024-11-14T11:31:29.853Z [INFO] (Copier) com.aws.greengrass.lifecyclemanager.GenericExternalService: Run script exited. {exitCode=137, serviceName=com.bf.ddd.Ble, currentState=STOPPING}
timvlaer commented 7 hours ago

When I restart greengrass via systemctl (systemctl restart greengrass), the system works as expected. My component gets a SIGTERM and properly terminates (exit code 0).

(I get exit code 137 when the component is killed.)

timvlaer commented 7 hours ago

Right now, I cannot quickly bump the aws.greengrass.Nucleus component to the latest 2.13.0 because it's incompatible with aws.greengrass.ShadowManager 2.3.6 which I also use. I'd like to avoid bumping all my dependencies without good reason, so let me know if you think it's worth the effort.