Fix: neptune-endpoints-info-lambda takes longer time than expected to reflect the latest cluster status

mti-takayama-t commented 1 year ago

Issue

We experienced a long time lag for ClusterEndpointsRefreshAgent in our app to recognize a newly added neptune instance via lambdaProxy.

For example, after patching neptune-endpoints-info-lambda:

--- a/neptune-endpoints-info-lambda/src/main/java/software/amazon/lambda/NeptuneEndpointsInfoLambda.java
+++ b/neptune-endpoints-info-lambda/src/main/java/software/amazon/lambda/NeptuneEndpointsInfoLambda.java
@@ -58,7 +58,10 @@ public class NeptuneEndpointsInfoLambda implements RequestStreamHandler {
         System.out.println(String.format("suspendedEndpoints: %s", this.suspendedEndpoints));

         refreshAgent.startPollingNeptuneAPI(
-                (OnNewClusterMetadata) metadata -> neptuneClusterMetadata.set(metadata),
+                (OnNewClusterMetadata) metadata -> {
+                    neptuneClusterMetadata.set(metadata);
+                    System.out.println("Refreshed cluster metadata");
+                },
                 pollingIntervalSeconds,
                 TimeUnit.SECONDS);
     }

, invoking the lambda with the default pollingIntervalSeconds (15 seconds) at the rate of 1 request every 15 seconds (+ about 2 seconds of CLI overhead) from one client resulted in the following refresh events:

2023-06-01T15:47:12.119+09:00
2023-06-01T15:55:42.176+09:00
2023-06-01T15:57:24.126+09:00
2023-06-01T16:00:48.219+09:00
2023-06-01T16:03:04.153+09:00
2023-06-01T16:07:19.251+09:00
2023-06-01T16:12:08.141+09:00
2023-06-01T16:15:15.556+09:00
2023-06-01T16:23:11.353+09:00
2023-06-01T16:25:10.197+09:00

The intervals of the events seem somewhat random, but about 2-13 minutes, which is far longer than the default pollingIntervalSeconds of 15 seconds.

This PR tries to improve the behavior above.

TestCases

1:
- pollingIntervalSeconds is 15 seconds by default
- one request every 15 seconds (+ about 2 seconds of CLI overhead) from one client
- expects about 17 seconds between each refresh (= 1 request per refresh)
```
2023-06-01T17:22:41.488+09:00
2023-06-01T17:22:58.805+09:00
2023-06-01T17:23:15.485+09:00
2023-06-01T17:23:32.463+09:00
2023-06-01T17:23:49.503+09:00
```
2:
- pollingIntervalSeconds is 15 seconds by default
- one request every 12 seconds (+ about 2 seconds of CLI overhead) from one client
- expects about 28 seconds between each refresh (= 2 requests per refresh)
```
2023-06-01T17:26:06.836+09:00
2023-06-01T17:26:34.526+09:00
2023-06-01T17:27:02.402+09:00
2023-06-01T17:27:30.784+09:00
2023-06-01T17:27:58.561+09:00
```
3:
- pollingIntervalSeconds is 15 seconds by default
- one request every 1 seconds (+ about 2 seconds of CLI overhead) from one client
- expects about seconds 15-18 between each refresh (= 5 or 6 requests per refresh)
```
2023-06-01T17:40:43.766+09:00
2023-06-01T17:41:01.703+09:00
2023-06-01T17:41:19.882+09:00
2023-06-01T17:41:37.662+09:00
2023-06-01T17:41:55.655+09:00
```
4:
- pollingIntervalSeconds is 15 seconds by default
- one request every 30 seconds (+ about 2 seconds of CLI overhead) from one client
- expects about seconds 32 seconds between each refresh (= 1 request per refresh)
```
2023-06-04T12:00:14.491+09:00
2023-06-04T12:00:46.419+09:00
2023-06-04T12:01:18.454+09:00
2023-06-04T12:01:50.640+09:00
2023-06-04T12:02:22.667+09:00
```

Environment

The version we used and on which the PR is based: v2.0.0
(but rebased the PR onto main HEAD following v2.0.1)
The JDK version we used: 17
(compiling with JDK 1.8 resulted in an error of git-commit-id-maven-plugin: io.github.git-commit-id:git-commit-id-maven-plugin:5.0.0:revision (get-the-git-infos) on project gremlin-client: Execution get-the-git-infos of goal io.github.git-commit-id:git-commit-id-maven-plugin:5.0.0:revision failed: Unable to load the mojo 'revision' in the plugin 'io.github.git-commit-id:git-commit-id-maven-plugin:5.0.0' due to an API incompatibility: org.codehaus.plexus.component.repository.exception.ComponentLookupException: pl/project13/maven/git/GitCommitIdMojo has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0)
Lambda configuration:
- MemorySize: 512
- ReservedConcurrentExecutions: 1
- Runtime: java17
- Architecture: arm64

We would appreciate it if you could review, confirm, and even retest this PR.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

iansrobinson commented 1 year ago

@mti-takayama-t Thank you very much for submitting this. I'll test and review before the end of the week. ian

iansrobinson commented 1 year ago

@mti-takayama-t I've tested this and am about to merge it. Thank you for submitting.

What I've noticed with this fix is that, when a refresh is due, the call to awake() will cause the Lambda to take longer to execute: awake() effectively waits on the main thread for the refresh to take place on the refresh agent thread. It was never intended for this pause to be visible to clients of the Lambda, but I don't think there's any way around it. The Lambda execution times are typically <20 ms, and in low throughput scenarios there is therefore a good chance that the Lambda context will be asleep whenever a scheduled refresh should next take place. I think we have to take the hit, as per your fix, to ensure that refreshing occurs when the Lambda is active.

Because the Lambda execution times are longer whenever a refresh occurs, the max duration of the Lambda has now gone up, from approx 100ms to around 700ms. The average, though, remains below 20ms. That max duration does have an impact however: because the CFN template I supply restricts the concurrency of the Lambda function (to 1, by default) , I've noticed, again in reasonably high throughput scenarios, that the Lambda is heavily throttled, as many clients attempt to get info from the Lambda around the same time. That's not necessarily an issue, though, because this throttling is masked in the client, where calls to the Lambda take place in the background.

Overall, then, I think we trade better freshness for some increased latency. But this increase in latency should generally not be visible to the client.

I actually think awake() should also be called by any 'application' Lambda functions that use the Neptune Gremlin Client and a refresh agent (I do a lot of testing with Lambda functions that call Neptune). The same problem will affect these Lambda functions as affects the neptune-endpoints-info-lambda: that is, refreshes may sometimes be scheduled for times when the Lambda context is asleep. I'm going to update the docs with this suggestion.

Thanks again for the PR.

ian

mti-takayama-t commented 1 year ago

@iansrobinson Thank you very much for your close testing and inspection. I agree with your discussion about the increased latency and the conclusion of its tradeoff.

We are highly grateful to you for taking the time to merge the PR.

aws / neptune-gremlin-client