aws-samples / aws-cloudhsm-jce-examples

Sample applications demonstrating how to use the CloudHSM JCE
MIT No Attribution
37 stars 57 forks source link

CloudHSM sign in and sign out problem #70

Open pabloiarriola opened 1 year ago

pabloiarriola commented 1 year ago

We are using the latest CloudHSM sdk 5 for java. We are creating the connection to it using AWS lambdas and the sign in and out is being done with the same code as the code examples that are in this repo. We are getting a problem when using a warm environment, as when trying to do the sign in it gives us the following error:

"message": "The underlying Provider connection was lost: Underlying connection to provider was lost", "name": "com.amazonaws.cloudhsm.jce.jni.exception.ProviderException",

When its a cold environment when dont have this problem as it does sign in and out, but when reusing it we get this exception. We are using loginWithExplicitCredentials

rday commented 1 year ago

Hi @pabloiarriola , can you confirm which version of SDK5 you are using? Depending on your host operating system's package manager, you can use:

Yum

yum search -v cloudhsm

Apt

apt search cloudhsm

This will give you the major and minor version.

pabloiarriola commented 1 year ago

Hi @pabloiarriola , can you confirm which version of SDK5 you are using? Depending on your host operating system's package manager, you can use:

Yum

yum search -v cloudhsm

Apt

apt search cloudhsm

This will give you the major and minor version.

we are using it as a layer for our lambda, the version we are packaging is the cloudhsm-jce-5.7.0.jar with log4j-api-2.17.1.jar and log4j-core-2.17.1.jar

rday commented 1 year ago

Thanks @pabloiarriola . Can you upgrade to the latest version of the JCE? The version is 5.8.0. This version has some updates which address the warm start issue you are experiencing.

pabloiarriola commented 1 year ago

hi @rday we updated to 5.9 and we are still getting "message": "The underlying Provider connection was lost: Communication with the device was lost during the execution of the function.", "name": "com.amazonaws.cloudhsm.jce.jni.exception.ProviderException",

rday commented 1 year ago

@pabloiarriola Sorry to hear, in this case we would need to collect more information to investigate. We recommend working with your Account Manager to open a support case. They can collect the necessary information and support can investigate your situation.

pabloiarriola commented 1 year ago

@rday thank you, just to verify this issue was addressed on version 5.8.0 correct?

guillomep commented 1 year ago

We are getting the same error and we are using the 5.8.0

rday commented 1 year ago

Hi @guillomep , you can try upgrading to the latest release at this time, 5.10, or you can try reaching out to your TAM to collect more information about your specific environment. We would need to see logs of around operations, keep alives, and when the connections were dropped.

Sabo-kun commented 9 months ago

Hi all,

We also are using CloudHSM provider (client 5.11.0) in a Java-based AWS Lambda. We configured for our initial tests implicit login (using ENV variables)

We can successfully interact with HSM if we continuously invoke the Lambda. Instead, when we stop invoking it for about 30 seconds we get this kind of log messages: 2024-01-18T16:19:53.186Z WARN [8] ThreadId(2) [cloudhsm_provider_common::keep_alive] CC000: Maximum keep-alive attempts have been reached for 10.4.1.27. Stopping keep-alive task. 2024-01-18T16:19:53.186Z INFO [8] ThreadId(2) [cloudhsm_provider_common::dispatcher] Exiting all active dispatcher operations 2024-01-18T16:19:53.186Z INFO [8] ThreadId(2) [cloudhsm_provider_common::dispatcher] Exiting all active dispatcher operations 2024-01-18T16:19:53.187Z ERROR [8] ThreadId(1) [cloudhsm_provider::hsm1::hsm_connection::error] Disconnected with server. Message: Tls disconnected. Reason: Send Failed. Dispatcher is now disconnected. 2024-01-18T16:19:53.187Z ERROR [8] ThreadId(1) [cloudhsm_provider_common::keep_alive] Keep-alive failed for 10.4.1.27. Internal Error: Internal error occurred. Error: HSM is disconnected 2024-01-18T16:19:53.346Z WARN [8] ThreadId(3) [cloudhsm_utils::retry] Receive error: Connection retry attempts on HSM failed. For Operation get_hsm_connection. Going to retry. Attempts 0/3 2024-01-18T16:19:53.865Z WARN [8] ThreadId(3) [cloudhsm_utils::retry] Receive error: Connection retry attempts on HSM failed. For Operation get_hsm_connection. Going to retry. Attempts 1/3

The client performs some connection retries, then establishes the connection and works properly. Unfortunately, when this issue occurs, the processing takes x10 compared to a normal execution (about 300ms vs 3500ms). The same code works normally in a spring boot application deployed on an EC2 instance.

Is there any solution to get the lambda to work properly in any invocation?

Thank you!

pabloiarriola commented 7 months ago

Hey @Sabo-kun did you find a solution?

@rday we are now having a problem in that it seems the connection is not even being started. We are using a layer and it seems like it never starts, as we dont get any error messages or anything. The lambda just times out. We are using version 5.9 Screen Shot 2024-03-21 at 07 34 10

guillomep commented 7 months ago

Still having no solution on 5.10.0 we are still having the problem.

Also we notice that sometimes we also get the following message

Unexpected error with the Provider: E2e failed to process the HSM response. Failed to decrypt using e2e.

kellyshkang commented 7 months ago

@pabloiarriola and @Sabo-kun , it sounds like the workflows are somewhat sporadic. Lambda will freeze your execution environment after processing has stopped. Depending on how long it takes for the next invocation, Lambda will "warm start" or "cold start" your code. The timeout for a warm start and a cold start are not defined.

What this means for CloudHSM is that our dataplane is not able to communicate with your client after the invocation has been frozen. If your lambda is "warm started", it is possible that our dataplane has timed out, but the client thinks the connection is still alive. This is a probable cause of the x10 processing time. During a cold start, everything is built from scratch. The client will establish all the connections, and things work much faster.

While this warm start delay is something we are aware of, we are still working on the right way to address the problem. Any data we could collect would be great. We can also work with you on your specific situation, but that would have to be done through Customer Support.

I'll update this issue as we make progress. Thanks for continuing to report!