aws / aws-sdk-java-v2

The official AWS SDK for Java - Version 2
Apache License 2.0
2.18k stars 845 forks source link

Possible Memory Leak - Connection Reaper #1679

Closed mtamassia closed 4 years ago

mtamassia commented 4 years ago

We have an application that creates and destroys EC2, S3, and several other SDK clients.

We build clients within a thread and close() the clients within the thread. Example:

Ec2Client ec2Client = Ec2Client.builder().region(Region.of(currentRegion)).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

// DO SOMETHING

ec2Client.close();

Memory consumption grows over time and here is a link to the memory heap dump, which is self explicatory: https://heaphero.io/my-heap-report.jsp?p=YXJjaGl2ZWQvMjAyMC8wMy8yLy0tbWVtb3J5LnppcC0xOS01NC0zMi5qc29uLS0=

Expected Behavior

client close() should not be leaving resources behind

Current Behavior

Threads and their associated data should be removed upon calling a client close() method

Your Environment

dagnir commented 4 years ago

Thanks for the report. Looks like the connection managers are not being ~~removed from the map during client close shutdown during client close.~~

Looks like I spoke too soon; looking over the code again, everything seems in order. And doing a simple test continuously creating a client, making a request, and closing it keeps the reaper map size at 0. Do you have a repro case that we can run?

mtamassia commented 4 years ago

Thank you for the quick response @dagnir

I have reviewed our code and found a couple instances where close() was not being properly called, however after the changes, we still see the memory buildup as you can see on the new report below. BTW, this time I forced a GC, just for sanity, before doing the heap dump.

https://heaphero.io/my-heap-report.jsp?p=YXJjaGl2ZWQvMjAyMC8wMy81Ly0tbWVtb3J5My5kdW1wLmd6LTE2LTQ0LTMwLmpzb24tLQ==

Not sure if it matters but here are the SDK clients we are using and how we are instantiating each of them.

SecretsManagerClient smClient = SecretsManagerClient.builder().credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

Ec2Client ec2Client = Ec2Client.builder().region(Region.of(this.getRegion())).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

PricingClient pClient = PricingClient.builder().region(Region.US_EAST_1).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

RdsClient rdsClient = RdsClient.builder().region(Region.of(this.getRegion())).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

CloudTrailClient client = CloudTrailClient.builder().credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).overrideConfiguration(ClientOverrideConfiguration.builder().retryPolicy(RetryPolicy.builder().throttlingBackoffStrategy(BackoffStrategy.defaultStrategy()).build()).build()).build();

ElasticLoadBalancingV2Client client = ElasticLoadBalancingV2Client.builder().region(Region.of(this.getRegion())).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

DirectConnectClient client = DirectConnectClient.builder().region(Region.of(currentRegion)).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

S3Client client = S3Client.builder().region(Region.of(this.getRegion())).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

EfsClient client = EfsClient.builder().region(Region.of(currentRegion)).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();
zoewangg commented 4 years ago

I tried creating multiple sdk clients, making api calls, and then closing the clients but was not able to reproduce the issue. (verified that IdleConnectionReaper#connectionManagers was empty in the end)

Could you provide a complete repro case?

debora-ito commented 4 years ago

We were not able to reproduce it, closing this due to no response. Please feel free to reopen if you are still experiencing this issue and can provide us a complete repro case.

gdhananjay commented 4 years ago

@debora-ito , @dagnir , @zoewangg, @mtamassia

I am facing same issue , I also suspect that there is possible memory leak in Connection Repeater. My application used to run pretty well for 1 week and then it was crashed because of Out Of memory. I attached Jprofiler and it seems connectionrepeater was not freeing up. Please look at below screenshot:

Screenshot 2020-04-12 at 10 29 01 AM

further drilling this heavy object i saw thousands of below instances,

Screenshot 2020-04-12 at 10 30 36 AM

And here is path to GC root to those thousands of objects,

Screenshot 2020-04-12 at 10 31 05 AM

My application is simple, it creates thread => create credential provide as shown in below code => Collect some data from cloudwatch using this credential provider => and then thread dies.

Code to create credential provider :

public static AwsCredentialsProvider getRoleCredentialsProvider(String _roleArn, String _roleSessionName,
            String _externalId, String _region) {

        AssumeRoleRequest assumeRoleRequest = null;

        if (_externalId == null || _externalId.isEmpty()) {
            assumeRoleRequest = AssumeRoleRequest.builder().roleArn(_roleArn).roleSessionName(_roleSessionName)
                    .durationSeconds(900).build();
        } else {
            assumeRoleRequest = AssumeRoleRequest.builder().roleArn(_roleArn).externalId(_externalId)
                    .roleSessionName(_roleSessionName).durationSeconds(900).build();
        }

        StsClient stsClient = null;

        // AWS recommends using Regional AWS STS endpoints hence region as input.
        if (_region == null || _region.isEmpty()) {
            stsClient = StsClient.builder().region(Region.US_EAST_1).build();
        } else {
            stsClient = StsClient.builder().region(Region.of(_region)).build();
        }

        return StsAssumeRoleCredentialsProvider.builder().stsClient(stsClient).refreshRequest(assumeRoleRequest)
                .asyncCredentialUpdateEnabled(true).build();
    }

Code in finally block of my application to close this provider:

            // Close StsAssumeRoleCredentialsProvider.
            if(credentialProvider != null && credentialProvider instanceof StsAssumeRoleCredentialsProvider) {
                ((StsAssumeRoleCredentialsProvider)credentialProvider).close();
            }

I can easily reproduce this issue, As jprofiler is running live it clearly show memory ++ because of these objects.

I am using below SDK version

        <dependency>
            <groupId>software.amazon.awssdk</groupId>
             <artifactId>sts</artifactId>
             <version>2.5.40</version>
        </dependency> 

Please let me know if you need any additional info.

zoewangg commented 4 years ago

Hi @Dhananjay-Ghongane closing the credential provider itself does not close the stsClient.

https://github.com/aws/aws-sdk-java-v2/blob/3d2e2e5ffe164e1876dbeccf5ef9fb486b2725b2/services/sts/src/main/java/software/amazon/awssdk/services/sts/auth/StsCredentialsProvider.java#L83

So you need to close the stsClient too. The reason that the stsClient is not closed automatically is because it's provided by customers.

gdhananjay commented 4 years ago

@zoewangg , Thanks, That should work. My jobs are short lived so, I may not want backend thread to periodically updated credentials. I am not using StsAssumeRoleCredentialsProvider, simply calling AssumeRole => take credentials => Use it and repeat the same.

mtamassia commented 4 years ago

@zoewangg I think what you posted above might resolve my issue too.

Our code had an implicit credentials provider:

Ec2Client ec2Client = Ec2Client.builder().region(Region.of(this.region)).credentialsProvider(InstanceProfileCredentialsProvider.builder().build()).build();

...

ec2Client.close();

so based on your message above, we changed it to this:

InstanceProfileCredentialsProvider credProvider = InstanceProfileCredentialsProvider.builder().build();
Ec2Client ec2Client = Ec2Client.builder().region(Region.of(this.region)).credentialsProvider(credProvider).build();

...

ec2Client.close();
credProvider.close();

In short, we are now closing the credentials provider. I have not tested this at scale but do you think that could explain the leak on our end?

zoewangg commented 4 years ago

@mtamassia I don't think not closing InstanceProfileCredentialProvider could cause IdleConnectionReaper memory leak since it's not using IdleConnectionReaper at all. Also, asyncCredentialUpdateEnabled is false by default, so there's really nothing to be closed in InstanceProfileCredentialProvider.

Is it possible that some SDK client were not closed properly (failed to be closed silently) in your use case? Can you enable DEBUG log level to see if there were any error messages?

It should be something like this: https://github.com/aws/aws-sdk-java-v2/blob/3d2e2e5ffe164e1876dbeccf5ef9fb486b2725b2/utils/src/main/java/software/amazon/awssdk/utils/IoUtils.java#L74

BartXZX commented 3 years ago

Hi @zoewangg , I'm running into a similar issue. The only difference is that I am using the default credential provider chain, which in turn resolves to the StsWebIdentityCredentialsProvider. After a lot of debugging I figured out that closing the client does not close the STS client created by the StsWebIdentityCredentialsProvider when it is closed.

TLDR; Provider chain in Lazy<T> never gets closed.

Long version; DefaultCredentialsProvider#close calls LazyAwsCredentialsProvider#close, which calls IoUtils.closeIfCloseable, but the provided delegate is not an AutoClosable type. It is actually a Lazy<AwsCredentialsProvider>, and Lazy<T> does not implement AutoClosable. So the chain (and consequently the STS client) in the delegate field never receives a close() call.

Would you like me to open a new issue?

Screenshot of the debugger image

Stack dump from IntelliJ

closeIfCloseable:89, IoUtils (software.amazon.awssdk.utils) [2]
close:50, LazyAwsCredentialsProvider (software.amazon.awssdk.auth.credentials.internal)
close:109, DefaultCredentialsProvider (software.amazon.awssdk.auth.credentials)
closeQuietly:70, IoUtils (software.amazon.awssdk.utils)
closeIfCloseable:87, IoUtils (software.amazon.awssdk.utils) [1]
lambda$close$0:86, AttributeMap (software.amazon.awssdk.utils)
accept:-1, 912038923 (software.amazon.awssdk.utils.AttributeMap$$Lambda$1914)
forEach:981, HashMap$Values (java.util)
close:86, AttributeMap (software.amazon.awssdk.utils)
close:79, SdkClientConfiguration (software.amazon.awssdk.core.client.config)
close:80, HttpClientDependencies (software.amazon.awssdk.core.internal.http)
close:73, AmazonSyncHttpClient (software.amazon.awssdk.core.internal.http)
close:118, BaseSyncClientHandler (software.amazon.awssdk.core.internal.handler)
close:12198, DefaultIamClient (software.amazon.awssdk.services.iam)
discover:37, IamServiceRoleAwsDiscovery (com.planonsoftware.cloud.library.discovery.iamrole)
provideParameters:63, PlanonEnvironmentActivitiesImpl (com.planonsoftware.cloud.pco.orchestrator.controllers.planonenvironment.activities)
invoke0:-1, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:62, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:43, DelegatingMethodAccessorImpl (jdk.internal.reflect)
invoke:567, Method (java.lang.reflect)
execute:214, POJOActivityTaskHandler$POJOActivityImplementation (com.uber.cadence.internal.sync)
handle:190, POJOActivityTaskHandler (com.uber.cadence.internal.sync)
handle:175, ActivityWorker$TaskHandlerImpl (com.uber.cadence.internal.worker)
handle:146, ActivityWorker$TaskHandlerImpl (com.uber.cadence.internal.worker)
lambda$process$0:71, PollTaskExecutor (com.uber.cadence.internal.worker)
run:-1, 640627828 (com.uber.cadence.internal.worker.PollTaskExecutor$$Lambda$1491)
runWorker:1128, ThreadPoolExecutor (java.util.concurrent)
run:628, ThreadPoolExecutor$Worker (java.util.concurrent)
run:830, Thread (java.lang)