aws / neptune-gremlin-client

A Java Gremlin client for Amazon Neptune that allows you to change the endpoints used by the client as it is running.
Apache License 2.0
5 stars 2 forks source link

Fix: neptune-endpoints-info-lambda takes longer time than expected to reflect the latest cluster status #1

Closed mti-takayama-t closed 1 year ago

mti-takayama-t commented 1 year ago

Issue

We experienced a long time lag for ClusterEndpointsRefreshAgent in our app to recognize a newly added neptune instance via lambdaProxy.

For example, after patching neptune-endpoints-info-lambda:

--- a/neptune-endpoints-info-lambda/src/main/java/software/amazon/lambda/NeptuneEndpointsInfoLambda.java
+++ b/neptune-endpoints-info-lambda/src/main/java/software/amazon/lambda/NeptuneEndpointsInfoLambda.java
@@ -58,7 +58,10 @@ public class NeptuneEndpointsInfoLambda implements RequestStreamHandler {
         System.out.println(String.format("suspendedEndpoints: %s", this.suspendedEndpoints));

         refreshAgent.startPollingNeptuneAPI(
-                (OnNewClusterMetadata) metadata -> neptuneClusterMetadata.set(metadata),
+                (OnNewClusterMetadata) metadata -> {
+                    neptuneClusterMetadata.set(metadata);
+                    System.out.println("Refreshed cluster metadata");
+                },
                 pollingIntervalSeconds,
                 TimeUnit.SECONDS);
     }

, invoking the lambda with the default pollingIntervalSeconds (15 seconds) at the rate of 1 request every 15 seconds (+ about 2 seconds of CLI overhead) from one client resulted in the following refresh events:

2023-06-01T15:47:12.119+09:00
2023-06-01T15:55:42.176+09:00
2023-06-01T15:57:24.126+09:00
2023-06-01T16:00:48.219+09:00
2023-06-01T16:03:04.153+09:00
2023-06-01T16:07:19.251+09:00
2023-06-01T16:12:08.141+09:00
2023-06-01T16:15:15.556+09:00
2023-06-01T16:23:11.353+09:00
2023-06-01T16:25:10.197+09:00

The intervals of the events seem somewhat random, but about 2-13 minutes, which is far longer than the default pollingIntervalSeconds of 15 seconds.

This PR tries to improve the behavior above.

TestCases

Environment


We would appreciate it if you could review, confirm, and even retest this PR.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

iansrobinson commented 1 year ago

@mti-takayama-t Thank you very much for submitting this. I'll test and review before the end of the week. ian

iansrobinson commented 1 year ago

@mti-takayama-t I've tested this and am about to merge it. Thank you for submitting.

What I've noticed with this fix is that, when a refresh is due, the call to awake() will cause the Lambda to take longer to execute: awake() effectively waits on the main thread for the refresh to take place on the refresh agent thread. It was never intended for this pause to be visible to clients of the Lambda, but I don't think there's any way around it. The Lambda execution times are typically <20 ms, and in low throughput scenarios there is therefore a good chance that the Lambda context will be asleep whenever a scheduled refresh should next take place. I think we have to take the hit, as per your fix, to ensure that refreshing occurs when the Lambda is active.

Because the Lambda execution times are longer whenever a refresh occurs, the max duration of the Lambda has now gone up, from approx 100ms to around 700ms. The average, though, remains below 20ms. That max duration does have an impact however: because the CFN template I supply restricts the concurrency of the Lambda function (to 1, by default) , I've noticed, again in reasonably high throughput scenarios, that the Lambda is heavily throttled, as many clients attempt to get info from the Lambda around the same time. That's not necessarily an issue, though, because this throttling is masked in the client, where calls to the Lambda take place in the background.

Overall, then, I think we trade better freshness for some increased latency. But this increase in latency should generally not be visible to the client.

I actually think awake() should also be called by any 'application' Lambda functions that use the Neptune Gremlin Client and a refresh agent (I do a lot of testing with Lambda functions that call Neptune). The same problem will affect these Lambda functions as affects the neptune-endpoints-info-lambda: that is, refreshes may sometimes be scheduled for times when the Lambda context is asleep. I'm going to update the docs with this suggestion.

Thanks again for the PR.

ian

mti-takayama-t commented 1 year ago

@iansrobinson Thank you very much for your close testing and inspection. I agree with your discussion about the increased latency and the conclusion of its tradeoff.

We are highly grateful to you for taking the time to merge the PR.