awslabs / aws-crt-java

Java bindings for the AWS Common Runtime
Apache License 2.0
55 stars 40 forks source link

CRT doesn’t handle graceful shutdown properly #561

Closed nicolas-laduguie closed 6 months ago

nicolas-laduguie commented 1 year ago

Describe the bug We are using AWS Java CRT in the context of AWS sdk V2 with transfer manager for files uploads. When our Spring boot application is shutdown gracefully, if a multipart upload is started through TransferManager (developer preview) during the graceful shutdown, the transfer is initiated but gets stuck.

Expected Behavior

A multipart upload through TransferManager should finish successfully even though it's started during a graceful shutdown.

Current Behavior

The multipart upload starts as we can see logs like :

2022-11-30 14:34:35.205  INFO 116 --- [rding-upload-14] s.a.a.t.s.p.LoggingTransferListener      : [] Transfer initiated...
2022-11-30 14:34:35.208  INFO 116 --- [rding-upload-14] s.a.a.t.s.p.LoggingTransferListener      : [] |                    | 0.0%
2022-11-30 14:34:35.233 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Creating an interceptor chain that will apply interceptors in the following order: [software.amazon.awssdk.core.internal.interceptor.HttpChecksumRequiredInterceptor@1366a521, software.amazon.awssdk.core.internal.interceptor.SyncHttpChecksumInTrailerInterceptor@113458ec, software.amazon.awssdk.core.internal.interceptor.HttpChecksumValidationInterceptor@275648d5, software.amazon.awssdk.core.internal.interceptor.AsyncRequestBodyHttpChecksumTrailerInterceptor@308d51cb, software.amazon.awssdk.core.internal.interceptor.HttpChecksumInHeaderInterceptor@16e4b556, software.amazon.awssdk.awscore.interceptor.HelpfulUnknownHostExceptionInterceptor@25a47c5e, software.amazon.awssdk.awscore.eventstream.EventStreamInitialRequestInterceptor@30a198df, software.amazon.awssdk.awscore.interceptor.TraceIdExecutionInterceptor@6af88409, software.amazon.awssdk.services.s3.endpoints.internal.S3ResolveEndpointInterceptor@5985d205, software.amazon.awssdk.services.s3.endpoints.internal.S3EndpointAuthSchemeInterceptor@5f4503e7, software.amazon.awssdk.services.s3.endpoints.internal.S3RequestSetEndpointInterceptor@639e0fb0, software.amazon.awssdk.services.s3.internal.handlers.EnableChunkedEncodingInterceptor@4d680706, software.amazon.awssdk.services.s3.internal.handlers.DisableDoubleUrlEncodingInterceptor@3a8af154, software.amazon.awssdk.services.s3.internal.handlers.EnableTrailingChecksumInterceptor@4da797d9, software.amazon.awssdk.services.s3.internal.handlers.CreateMultipartUploadRequestInterceptor@5e57f252, software.amazon.awssdk.services.s3.internal.handlers.GetObjectInterceptor@538a9ea9, software.amazon.awssdk.services.s3.internal.handlers.ExceptionTranslationInterceptor@3ee7d5dd, software.amazon.awssdk.transfer.s3.internal.ApplyUserAgentInterceptor@4f8f8487, software.amazon.awssdk.services.s3.internal.handlers.AsyncChecksumValidationInterceptor@2eab3a46, software.amazon.awssdk.services.s3.internal.handlers.GetBucketPolicyInterceptor@295396ec, software.amazon.awssdk.services.s3.internal.handlers.PutObjectInterceptor@299af726, software.amazon.awssdk.services.s3.internal.handlers.SyncChecksumValidationInterceptor@1a82e52e, software.amazon.awssdk.services.s3.internal.handlers.DecodeUrlEncodedResponseInterceptor@6d8c8532, software.amazon.awssdk.services.s3.internal.handlers.CreateBucketInterceptor@585d8a99, software.amazon.awssdk.services.s3.internal.handlers.CopySourceInterceptor@26807573, software.amazon.awssdk.services.s3.internal.crt.DefaultS3CrtAsyncClient$ValidateRequestInterceptor@2ba64f18, software.amazon.awssdk.services.s3.internal.crt.DefaultS3CrtAsyncClient$AttachHttpAttributesExecutionInterceptor@228aea4b]
2022-11-30 14:34:35.238 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Interceptor 'software.amazon.awssdk.services.s3.endpoints.internal.S3EndpointAuthSchemeInterceptor@5f4503e7' modified the message with its modifyRequest method.
2022-11-30 14:34:35.239 TRACE 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Old: PutObjectRequest(Bucket=bucket, Key=key, Tagging=tags)
New: PutObjectRequest(Bucket=bucket, Key=key, Tagging=tags)
2022-11-30 14:34:35.239 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Interceptor 'software.amazon.awssdk.transfer.s3.internal.ApplyUserAgentInterceptor@4f8f8487' modified the message with its modifyRequest method.
2022-11-30 14:34:35.240 TRACE 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Old: PutObjectRequest(Bucket=bucket, Key=key, Tagging=tags)
New: PutObjectRequest(Bucket=bucket, Key=key, Tagging=tags)
2022-11-30 14:34:35.257 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Interceptor 'software.amazon.awssdk.services.s3.endpoints.internal.S3RequestSetEndpointInterceptor@639e0fb0' modified the message with its modifyHttpRequest method.
2022-11-30 14:34:35.257 TRACE 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Old: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=encodedpath, headers=[Content-Length, Content-Type, x-amz-tagging], queryParameters=[])
New: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=/bucketencodedpath, headers=[Content-Length, Content-Type, x-amz-tagging], queryParameters=[])
2022-11-30 14:34:35.257 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Interceptor 'software.amazon.awssdk.services.s3.internal.handlers.PutObjectInterceptor@299af726' modified the message with its modifyHttpRequest method.
2022-11-30 14:34:35.257 TRACE 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Old: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=/bucketencodedpath, headers=[Content-Length, Content-Type, x-amz-tagging], queryParameters=[])
New: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=/bucketencodedpath, headers=[Content-Length, Content-Type, Expect, x-amz-tagging], queryParameters=[])
2022-11-30 14:34:35.257 DEBUG 116 --- [rding-upload-14] software.amazon.awssdk.request           : [] Sending Request: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=/bucketencodedpath, headers=[amz-sdk-invocation-id, Content-Length, Content-Type, Expect, User-Agent, x-amz-tagging], queryParameters=[])

But then nothing happens, so seems like the multipart upload is stuck.

On CRT side, we can see error logs like :

[ERROR] [2022-12-06T15:47:19Z] [00007f821ae37700] [S3MetaRequest] - id=0x7f82a0009090 Could not read from body stream.
[ERROR] [2022-12-06T15:47:19Z] [00007f821ae37700] [S3MetaRequest] - id=0x7f82a0009090 Could not prepare request 0x7f8278012fb0 due to error 9216 (Attempt to use a JVM that has already been destroyed). 

That show CRT has detected beginning of JVM destroy, but not managing graceful shutdown.

Reproduction Steps

Spring boot 2.5.12 AWS SDK 2.18.3 s3-transfer-manager 2.18.3-PREVIEW

Possible Solution

No response

Additional Information/Context

No response

AWS Java SDK version used

2.18.3

JDK version used

17

Operating System and version

Docker Linux Ubuntu Focal 20

bretambrose commented 1 year ago

I don't think your expectation is achievable. There comes a point in shutdown where it is not possible for native code to call back into the JVM and we do not know of anything more accurate (timepoint-wise) than the JVM shutdown hook. If you have a different understanding, then there might be something we could do.

One lesser possibility that comes to mind is potentially a registry of incomplete CRT futures that could at least be failed-en-masse by the CRT's JVM shutdown hook, which would at least unblock waiters.

nicolas-laduguie commented 1 year ago

IMHO, this is too early during an application shutdown to bind to a shutdown hook, because modern applications are doing/can do things still during the shutdown. Taking concrete example :

It can even be the case where request received (upload not initiated yet) -> shutdown received -> upload initiated -> CRT resources no more available -> failure

A solution would be to give developers the choice to use the CRT defined shutdown hook, or disable the hook via a property, and expose a stop method that the developer would need to call during the shutdown. That would be the easiest way to me.

bretambrose commented 1 year ago

That might be a reasonable compromise.

faucct commented 1 year ago

One lesser possibility that comes to mind is potentially a registry of incomplete CRT futures that could at least be failed-en-masse by the CRT's JVM shutdown hook, which would at least unblock waiters.

Is not it possible for them to be failed explicitly by closing the HTTP-client?

bretambrose commented 1 year ago

Revisiting this, we've thrown around a couple of ideas that might help here; I may take a stab at them shortly:

  1. Experiment with having the shutdown hook's first execution do nothing but schedule a followup shutdown hook that does the actual firewall between native and the JVM. The hope is that shutdown hook registration is a list append so this "hack" would delay CRT shutdown until after all other hooks that were previously registered (pre-shutdown) had a chance to run.
  2. For rare cases where the above doesn't solve the problem (user code keeps resubmitting new shutdown hooks after shutdown starts), add a static API to disable the CRT shutdown hook as well as one to invoke the CRT shutdown manually. This would let a user from whom our default approach is inadequate to shutdown the CRT when they are ready for it to be shut down.
nicolas-laduguie commented 1 year ago

Thank you for feedback. The solution 2 is enough for us because in general with Java and spring boot ecosystem, it is easy to control all the shutdown flow, and it would be easy to call the CRT shutdown manually at the right time. Would you have a timeline for this fix?

bretambrose commented 1 year ago

Don't have a timeline but I can probably squeeze it in over the next week. A cursory google search just demonstrated that proposal (1) is a bad idea as it indicates there's no real fifo order (and they're concurrent too) to shutdown hook registration vs. execution.

I'm wondering if ref-counting may make sense here. So every call to disable_crt_shutdown() requires a corresponding call to shutdown_crt() and then the call that makes the ref count zero (almost always the first one, but supports complex cases where multiple dependencies might be trying to control the shutdown time of the CRT. In that case, the last user shutdown hook would "win" and the CRT would shutdown after its invocation, which would be the desired behavior).

bretambrose commented 1 year ago

Here's a quick sketch (ie no testing yet):

https://github.com/awslabs/aws-crt-java/pull/672

The basic contract is that anyone who wants to delay/control the CRT shutdown calls enableManualShutdown and when it's safe (from their perspective) to shutdown the CRT, they call manualShutdown. When the last matching call to manualShutdown is invoked, then the CRT actually gets shutdown.

Drareeg commented 9 months ago

@bretambrose Do you have an update on this issue? We are in a very similar situation as @nicolas-laduguie

bretambrose commented 8 months ago

I might be able to squeeze this into next oncall (next week).

Would the proposed solution PR work for you?

Drareeg commented 8 months ago

Yes

bretambrose commented 6 months ago

This was addressed a while back by https://github.com/awslabs/aws-crt-java/pull/672 and released in v0.29.7