aws-greengrass / aws-greengrass-nucleus

The Greengrass nucleus component provides functionality for device side orchestration of deployments and lifecycle management for execution of Greengrass components and applications. This includes features such as starting, stopping, and monitoring execution of components and apps, interprocess communication server for communication between components, component installation and configuration management.
Apache License 2.0
109 stars 45 forks source link

Resource leak when using IPC subscription #1650

Closed erikfinnman closed 2 months ago

erikfinnman commented 2 months ago

Describe the bug We have detected what appears to be a resource leak in the Greengrass nucleus related to IPC subscriptions.

When the below Python code is executed (in a Greengrass component) which constantly creates an IPC client, sets up a topic subscription and then closes the subscription and client, the underlying resources appear not to be freed:

To Reproduce

    log.info("Mem-test of IPC client")
    count = 0
    while True:
        ipc_client = GreengrassCoreIPCClientV2()
        request_id = uuid.uuid1()
        response_topic = f"dummy_method-response-{request_id}"
        def response_listener(message: SubscriptionResponseMessage) -> None:
            log.info("Response listener")
        def error_listener(_: Exception) -> Union[None, bool]:
            log.info("Error listener")
            return True

        _, operation = ipc_client.subscribe_to_topic(
            topic=response_topic,
            on_stream_event=response_listener,
            on_stream_error=error_listener,
        )
        operation.close()
        ipc_client.close()
        count += 1
        if count > 10000:
            log.info("Created %s clients", count)
            count = 0
            time.sleep(1)

Expected behavior Closed resources are freed, JVM does not fail with java.lang.OutOfMemoryError: Java heap space.

Actual behavior If the Greengrass heap is set to something like 100Mb, the memory is exhausted after about 15-20 minutes when running the above snippet, which we can see by enabling the Native Memory Tracking feature in the JVM.

The above code snippet was the most compact way we were able to replicate the problem we have been seeing on our production devices (but there the memory leak takes several weeks to manifest since we obviously don’t create clients as frequently as in the code snippet above).

Analyzing memory dumps of the JVM identifies the com.aws.greengrass.builtin.services.pubsub.PubSubIPCEventStreamAgent as the object retaining almost all memory. Digging into the references of this class reveals hundreds of thousands of objects of type java.util.concurrent.ConcurrentHashMap$Node which in turn have references to com.aws.greengrass.builtin.services.pubsub.SubscriptionTrie. It looks like this class contains the topic name of the generated subscription.

Studying the IPC documentation I can’t see anything obviously wrong with our code snippet - both the Greengrass IPC client and the subscription operations are closed - shouldn’t this free up all resources?

Environment JDK openjdk version "11.0.24" 2024-07-16 OpenJDK Runtime Environment (build 11.0.24+8-post-Debian-2deb11u1) OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Debian-2deb11u1, mixed mode)

Python Python 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] on linux

SDK awsiotsdk==1.19.0

Greengrass Nucleus 2.11.2

OS Linux 5.15.61-v8+ #1579 SMP PREEMPT 2022 aarch64 GNU/Linux Linux 5.15.0-112-generic #122-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

Additional context Also tried by updating the awsiotsdk to the latest version, but it made no difference.

MikeDombo commented 2 months ago

Hello, Please follow the IPC best practices to create only 1 client per component. Do not create more than one client for any reason. https://docs.aws.amazon.com/greengrass/v2/developerguide/interprocess-communication.html#ipc-best-practices

For anything further please open an issue on the Python SDK as this is not a problem with Greengrass Nucleus. https://github.com/aws/aws-iot-device-sdk-python-v2

erikfinnman commented 2 months ago

Hi, thanks for the quick reply. I understand that this is the best practice - but should not the underlying resources be released when the client is closed? Or do you mean that this could be a problem in the Python SDK?

MikeDombo commented 2 months ago

I'm saying that you should 1) create only 1 client and reuse it 2) this is an issue with the Python client, not Nucleus. Nucleus cannot free the resources because the client is still connected.

erikfinnman commented 2 months ago

Ok, so you're saying that client is still connected even though it's closed? That sounds indeed like an issue in the Python client.

MikeDombo commented 2 months ago

Yes.

erikfinnman commented 2 months ago

Then I'll close this issue and create a new one for the Python SDK.

erikfinnman commented 2 months ago

I forgot to mention that I did indeed try the same code but with just one client.

I still get the same resource leak - would you say that it's still an issue in the Python SDK with a failure to close resources?

jcosentino11 commented 2 months ago

Yep that would still be an SDK issue, looks like they're taking a look at it, thanks