Snapstart before checkPoint not getting called

aj070 commented 1 year ago

I have a spring boot application (v2.7.2) which connects to a database using Hikari connection pool. When deployed with aws serverless java container and snapstart, the beforecheckpoint function of the application is correctly called to evict connections from the pool. But the same is not happening when deployed with web adapter as given in the example springboot-zip. Looks like there is some issue with snapstart when used with adapter, but couldn't find it documented anywhere. Pretty sure that my CraC configuration for checkpointing is right because it works with jaws-serverless-java-container

bnusunny commented 1 year ago

Thanks for reporting this. Lambda Web Adapter has not implemented SnapStart runtime hooks at this moment. The adapter and the SpringBoot application are two separate processes. The in-process runtime hooks design won't work in this case.

We are thinking about adding two http calls to trigger runtime hooks in the web application process. For example, before snapshot, the adapter sends a POST request to /checkpoint path on the web app, and sends another POST request to /resume path after resume. These two requests will be the runtime hooks for the web application. You could change the actual paths via configuration.

Do you think this makes sense?

aj070 commented 1 year ago

@bnusunny First of all thanks for this excellent tooling to deploy frameworks to lambda. I assume what you are proposing is to expose 2 APIs in the application by the application developer, so that the Web Adapter will call it over http to trigger checkpointing and restoration. Below comments are based on the fact that my assumptions are right (If not, please correct me): While the proposed solution work perfectly for my use case, I think it is something I prefer not to do. My reason for choosing Lambda Web Adapter over serverless java container was that without any change to my code (at max adding a health endpoint), I could deploy my application to lambda. While I had to implement CraC myself for spring boot my intuition is that frameworks will add these to their core soon (Micronaut already has micronaut-crac and I have ported it to my spring boot application). I, as an application developer ideally would not want to think about my CraC implementations. Your proposed solution will force the application developers to ensure all the network resources he has in his application is properly checkpointed and restored. I think it is better to abstract it from the developers and is better to be handled by frameworks.

Having said that I'm a bit confused why the Init phase not automatically trigger the beforeCheckpoint runtime hook, when an extension (web-adapter in this case) is added. There is no such mention in snapstart documentation that the extensions will have to implement the snapstart runtime hooks for it to work correctly.

bnusunny commented 1 year ago

Your assumptions are correct. That's what I'm proposing.

The SnapStart Runtime Hooks are actually implemented by Java Runtime and expose as CRaC api to developers. This works for normal Java functions because the function code is actually running inside the Java Runtime process.

LWA is an extension and also a custom runtime process. LWA could receive signals for runtime hooks. But since the web application is not running within LWA process, it is not possible to trigger CRaC api in another process without sending a request to it.

I could also provide a Java package which expose the two APIs and trigger the CRaC hooks for you. All you need to do is to include this package. Would this help?

bnusunny commented 1 year ago

I could expose these two APIs over a Unix domain socket and make it more secure for IPC.

aj070 commented 1 year ago

Definitely a Java package will help from a developer's view point. But won't it increase your scope of things to handle. You will have to plan implementations for almost all the network bound solutions supported by frameworks.

My understanding of an extension was that it can have a separate runtime but the runtime on which my code will run is still the AWS managed java runtime (In the console also it is shown like that). In that case can't the extension delegate the task of checkpointing to Java runtime on which my code runs. I have to admit that my understanding of extensions are very limited and what I say may be completely wrong. It would be great if you can guide me to a good write up on extensions and also help me understand why adding this extension altered the behaviour of Java runtime.

aj070 commented 1 year ago

I could expose these two APIs over a Unix domain socket and make it more secure for IPC.

http requests can expose CraC to outside world. If you are taking the IPC path then Unix domain socket is better than http. Not sure if there is an even better way of resolving this as I have limited understanding of overall setup.

bnusunny commented 1 year ago

Lambda Web Adapter Layer contains two files: one is lambda-adapter binary, which is the actual extension and automatically startup before Java runtime. Another one is a wrapper script bootstrap, which is executed when Java Runtime boots up. This script calls the run.sh script to start up a web application. it effectively replaces Java Runtime process. This is how LWA layer works.

To read more about wrapper script, checkout the Lambda Developer Guide here.

intirix commented 1 year ago

I think having the web adapter call /checkpoint and /restore would be a good compromise. I think it would make sense to have a library that would take the incoming /checkpoint and /restore requests and turn them into the standard CraC API calls. That way, apps that use the CraC API can continue to function. This could be done in two phases. Early adopters can listen to /checkpoint and /restore while we wait for the frameworks to potentially add support for bridging between the URIs and the CraC APIs.

If you do go down the path of posting into /checkpoint and /restore, then I think we would need a way for the adapter to block those requests from coming in. I wouldn't want an external entity to be able to close my DB connections over and over again.

bnusunny commented 1 year ago

We're making progress on this issue and have finalized a plan to implement runtime hooks. Here's an overview of the planned solution:

We'll update both the Lambda Rust Runtime and Lambda Web Adapter to receive runtime hook notifications.
The Lambda Web Adapter will be configured to send HTTP POST requests to the web app via the /checkpoint and /restore endpoints. These requests will include a local secret in the Authentication header. Note that the endpoint paths and the secret can be customized.
The web app will be responsible for listening to the /checkpoint and /restore paths. On receiving POST requests, the app should validate the Authentication header and carry out necessary actions. The inclusion of the Authentication header is designed to ensure the web app retains control and can safeguard against unauthorized access.

Our team is actively working on the implementation of these solutions. We will continue to provide updates on this issue as we progress.

aj070 commented 1 year ago

Thanks for the update.

Also can you clarify this. You mentioned:

Another one is a wrapper script bootstrap, which is executed when Java Runtime boots up. This script calls the run.sh script to start up a web application. it effectively replaces Java Runtime process.

But the lambda extension documentation says that internal extensions (wrapper script is one) run in the same process as a different thread. Then how it will replace the Java Runtime process?

bnusunny commented 1 year ago

The bootstrap script uses exec to execute the run.sh. This completely replaces the Java runtime process.

exec -- "${LAMBDA_TASK_ROOT}/${_HANDLER}"

aj070 commented 1 year ago

In that case a eval should work with snapstart right? I tried it in my repo and still the results are same as in exec. What am I missing?

awslabs / aws-lambda-web-adapter

Snapstart before checkPoint not getting called #184