awslabs / aws-lambda-rust-runtime

A Rust runtime for AWS Lambda
Apache License 2.0
3.3k stars 335 forks source link

Lambda Extension sometimes fails with "Connection reset by peer" #837

Closed dgarcia-collegeboard closed 5 months ago

dgarcia-collegeboard commented 6 months ago

Hey!

I've derived an example from this repository: HERE

But instead of pushing to firehose, pushing to KDS instead. See minimal example

I've added some extra logic to my version of the above code where I'm providing custom credentials to the KDS client that's instantiated, but mostly, my implementation is the same. Is there a common reason for the Connection reset by peer error? It seems like the extension doesn't spin up the logs processor unless I invoke my lambda again, but this could just be because the async processing means any logs made in the Processors call method aren't spit out until they're resolved. I've seen some calls to kinesis succeed, but others seem to fail unexpectedly with this error:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" } }), connection: Unknown } }

The above error is logged during a match on the result of the future that is pinned inside of a Box in the example, expanded from this value HERE

Any guidance would be much appreciated!

calavera commented 6 months ago

Is there a common reason for the Connection reset by peer error?

Is this error produced when you send the messages to KDS? This sounds like a problem with the SDK, KDS itself, unrelated to the extension.

It seems like the extension doesn't spin up the logs processor unless I invoke my lambda again,

The log processor runs as a sidecar and it should always be running. I don't think logs in the extension are sent to the extension itself, if that's what you're expecting.

dgarcia-collegeboard commented 6 months ago

The log processor runs as a sidecar and it should always be running. I don't think logs in the extension are sent to the extension itself, if that's what you're expecting.

I understand the part about extensions not receiving extension logs. It is clear to me that that is a separate available subscription. Thanks for clarifying that the extension is a sidecar and should always be running. That is also what I understood, but it helps narrow down the issue.

Is this error produced when you send the messages to KDS? This sounds like a problem with the SDK, KDS itself, unrelated to the extension.

Yes. The issue only occurs when sending to KDS. I'll spend some time looking into some more examples of pushing to KDS via Rust SDK, however, the example repository demonstrates how I am calling out to KDS HERE and it's not much different from the example

I haven't been able to gather further information on the underlying cause for the error, despite adding tracing logs and other things. I'll investigate some more KDS examples to see if my sample implementation differs in any meaningful way.

dgarcia-collegeboard commented 5 months ago

@calavera Hey there, finally have some time to revisit this.

Just wanted to ask to see if there were any new thoughts around this? Have you tried the minimal repro and seen the same behavior? Im considering whether this is a side-effect of lambda runtime shutting down and the extension not handling closing the connections gracefully. If that's the case, does it mean that unfulfilled / unsent requests out to external services need to get cached for future runs somehow? Having trouble diagnosing this issue.

calavera commented 5 months ago

I have not been able to reproduce this.

Im considering whether this is a side-effect of lambda runtime shutting down and the extension not handling closing the connections gracefully.

This is a strong possibility. What you need to do in this case, is to subscribe your extension to SHUTDOWN events, and then flush the messages when the extension receives that event.

dgarcia-collegeboard commented 5 months ago

I have not been able to reproduce this.

Im considering whether this is a side-effect of lambda runtime shutting down and the extension not handling closing the connections gracefully.

This is a strong possibility. What you need to do in this case, is to subscribe your extension to SHUTDOWN events, and then flush the messages when the extension receives that event.

@calavera I'll be looking closer at this part at some point, for the meantime, I was able to find some more info, perhaps worth considering in regards to the firehose example?

If I decrease the timeout_ms of the Extension log buffer to the minimum (25 ms), a different error is thrown sometimes, and different behavior seems to be exhibited. This side-effect should be noticeable with the firehose example. Part of the firehose example is broken, here's why:

When the timeout is set low enough, the log buffer spins up before any messages arrive (from what I can tell) and an empty record is attempted to be sent. This is because the match on the events doesn't cause different behavior on the creation of the future based on the type of log delivered by the buffer, it still creates a set_record future. HERE BUT, you only see this behavior if you remove unreachable!(), which is also a change I had to make, because otherwise the extension crashes the lambda that is leveraging it. This is some logic I'm working to resolve on my derivative.

It seems like there's a little bit of a race condition here between the lambda running, the extension starting / ending, the extension log buffer delivering the logs to the extension.

With the default timeout_ms (1000ms) for extensions, if a lambda runs quick enough, the log buffer won't deliver logs for a long time (in the lambda world), and you won't see the extension ingesting those logs until separate runs of the lambda.

(The above potentially being related to what you said about this here:)

This is a strong possibility. What you need to do in this case, is to subscribe your extension to SHUTDOWN events, and then flush the messages when the extension receives that event.

This seems to cause either 1) some delays to the logs getting forwarded or 2) the logs not getting forwarded until subsequent lambda invocations. Situation (2) seems to happen semi-often, so there's some data-loss to consider. What if the last lambda run had critical info we couldn't forward because the log buffer didn't spin up the logs processor?

On the other hand, with the timeout low enough, the log buffer may spin up the extension with an empty vec, causing incomplete messages to get sent, with the below error getting thrown:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Canceled, hyper::Error(IncompleteMessage))

I'll try to update with more info as soon as I have some, but hopefully this all makes sense. Please let me know if I missed anything!

dgarcia-collegeboard commented 5 months ago

Just adding a bit more datapoints

Cold start -> cold start: Cold start 1: lambda outputs some logs, timeout_ms set to 25ms Extension gets triggered to send logs to KDS / firehose The future is created and pinned Cold start 2: the extension spins up again, the results of the pinned future are outputted with tracing If the lambda completes invocation before the request to KDS / firehose is fulfilled, on next cold start the pinned future seems to result in the error:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(ChannelClosed), connection: Unknown } })

The lambda outputs some logs, timeout_ms set to 25ms The extension ingests these logs, sends to KDS / firehose successfully / immediately

Cold start -> warm start: Cold start: lambda outputs some logs, timeout_ms set to 1000ms Extension gets triggered to send logs to KDS / firehose The future is created and pinned and the logs are pushed to KDS / firehose immediately

warm start: lambda outputs some logs, timeout_ms set to 1000ms Extension is never triggered to send logs until the lambda execution environment is shut down 5 minutes later, when execution environment is destroyed, logs are outputted declaring successful push to KDS / firehose.

1 Cold start -> 20 warm starts: Cold start: lambda outputs some logs, timeout_ms set to 1000ms Extension gets triggered to send logs to KDS / firehose The future is created and pinned and the logs are pushed to KDS / firehose immediately

warm start 1 - 19: lambda outputs some logs, timeout_ms set to 1000ms Extension gets triggered to send logs to KDS / firehose The future is created and pinned and the logs are pushed to KDS / firehose immediately

warm start 20: lambda outputs some logs, timeout_ms set to 1000ms Extension is never triggered to send logs until the lambda execution environment is shut down 5 minutes later, when execution environment is destroyed, logs are outputted declaring successful push to KDS / firehose.


It seems like there's cold start -> cold start issues if you use a low timeout, and there's separate issues for cold start -> warm start with a large delay in the extension getting kicked off to forward logs.

It seems like the default timeout for the Extension log buffer at least doesn't have as many issues with consistency in forwarding the logs, but there's an issue of how long it takes the warm start logs to get sent. It's interesting to note that the delay I'm seeing in warm starts only exists for the last invocation of the lambda, meaning, the last warm start. The warm starts in between don't seem to have issues with latency.

Maybe this issue is more workable going from default timeout_ms configuration, however, it's a bit unclear to me why reducing the timeout_ms configuration results in issues with the connections between the services.

For the latency on the final warm start: Could this be resolved by subscribing to more event types to kick off the extension more times?

calavera commented 5 months ago

Woow, thanks a lot for diving deep into this.

For the latency on the final warm start: Could this be resolved by subscribing to more event types to kick off the extension more times?

I'm not sure, to be honest. The buffering configuration is basically telling Lambda how often it needs to send the logs to the extension. I would expect that if the timeout is low, you'd get the logs consistently, at a faster pace, and that would not have any impact in external connections. An extension is just an HTTP api that receives data from Lambda, it's unclear to me too why reducing the timeout would result in connectivity issues.

A cold start is basically a new Lambda sandbox starting up. I don't understand why simultaneous cold starts cause problems connecting to KDS. My understanding is that it also spawns a completely new extension.

@greenwoodcm might have more context about how Lambda interacts with extensions. I think I'm as lost as you are here. 😔

dgarcia-collegeboard commented 5 months ago

@calavera

For the latency on the final warm start: Could this be resolved by subscribing to more event types to kick off the extension more times?

In regards to this, there's no other event types we could kick the extension off with. Came to the conclusion that the event types are technically missing warm start beginning / ending so the extension only gets kicked off in narrow situations (which mostly works, but seems to have some edge cases), namely: Lambda cold start (extension inits), timeout_ms exceeded, lambda shutdown

Note that none of the above events really coincide with a warm start invocation or warm start invocation end. I believe this is why we see a delay in the final warm start delivery. The only event kicking off the extension at that point is the shutdown of the execution environment. I can't fully track why this means that all the other warm starts before the last have no delay though

bnusunny commented 5 months ago

You can register for "INVOKE" event to receive notifications when an invoke happens (both cold and warm).

For end of invokes, Lambda doesn't send notifications to extensions, but you can add yourself. Check out this example.

dgarcia-collegeboard commented 5 months ago

Hey there!

just updating, no updates I can provide from my side, but I did find this response interesting and potentially relevant to the discussion: https://github.com/awslabs/aws-sdk-rust/issues/1106#issuecomment-2016861083

dgarcia-collegeboard commented 5 months ago

Another potentially related ticket: https://github.com/awslabs/aws-sdk-rust/issues/440

calavera commented 5 months ago

hey @dgarcia-collegeboard , I'm going to close this issue since it looks like a problem with the SDK. That way we don't duplicate communications. Let me know if someone finds out that the issue is related to the extension library, and I'll be happy to reopen this, but for what I can read, it sounds like the problem is the SDK's hyper connection.

github-actions[bot] commented 5 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one.

dgarcia-collegeboard commented 5 months ago

@calavera understood! Thanks for your help in trying to get to the bottom of this. Any info you might be able to provide in the other issue would be appreciated! :D