aws-samples / lambda-refarch-streamprocessing

Serverless Reference Architecture for Real-time Stream Processing
Apache License 2.0
349 stars 128 forks source link

Explain failure scenarios #1

Closed mhart closed 8 years ago

mhart commented 8 years ago

Considering this is a reference architecture, it would be good to have some explanation of the failure scenarios here, even if just in the comments.

Two things jump out at me initially:

  1. The BatchWriteItem call may return items that were unprocessed – these don't seem to be dealt with at all (are they just ignored, or does the SDK retry these behind the scenes?) In either case, a comment explaining this in the source would be great.
  2. If an error occurs in the BatchWriteItem call, it is logged and the function exits. It would be great to have a comment in the source explaining what actually happens in this case. Will the Lambda function be called again for X times retrying this? Or what is the user supposed to do?
melwin commented 8 years ago

Thanks for your comment! You are right that the batch write to DynamoDB can fail. Currently that is not handled especially gracefully nor optimally in the code as it assumes you have sufficient capacity provisioned in your DynamoDB table to accept all the writes. If the BatchWriteItem call fails, triggering the context.fail(err) signal, the batch will be retried by the Kinesis event source. So at least the data is not lost, but it's inefficient as the reprocessing of the same batch will try to BatchWriteItem all the entries into the table again.

I'll add a comment about this in the code and perhaps look into also adding a retry of the unprocessed entries - as an attempt to get the complete batch through - to show what that pattern would look like.

How does that sound to you?

mhart commented 8 years ago

It's sounds... ok... ish.

I mean, this project really should be demonstrating best-practice – and because it's from Amazon most people will assume that it is demonstrating that, and will copy/paste/use-similar-code in their projects. So I think it should strive to have the most realistic, best-practice code that it possibly can. People should be able to use this and have a reasonable expectation that they won't lose events.

Everyone I know who's used DynamoDB has encountered throttling at some point – it's a highly realistic scenario and happens regularly – even more so if DynamoDB itself is having problems and Amazon chooses to throttle it from their end. So at the very least I think the UnprocessedItems should be retried – preferably with exponential backoff.

In terms of what should happen when an actual error is returned – I'm not entirely sure. The fact that Kinesis/Lambda will retry is good (how many retries is it? 3?) – and probably sufficient for most scenarios, but perhaps there's some error-specific logic that should occur (I know some errors encourage retries more than others).

Having all of this documented and spelt out (in comments at least) will make it much clearer to the developer what they need to do, if anything, to make the code production ready. But ideally it should be this way out of the box – if Amazon can't make it production ready, then who can? :smile_cat:

melwin commented 8 years ago

Thanks for your input - I'll look at adding the retry to make that more efficient from a DynamoDB perspective. Having the retries with exponential backoff in Lambda also means the Lambda execution takes longer and the related cost goes up - so it's a balance of what is the most cost efficient in your case.

Just note that the retry of the UnprocessedItems is not needed to avoid losing events. The whole batch will be retried for as long as the records are retained in the Kinesis stream (default 24h - max 168h), which depends on how the stream is configured. So as long as the put can be processed within this time frame - something that in most cases should be reasonable to expect as it can be set to one week - there are no events lost.

mhart commented 8 years ago

The whole batch will be retried for as long as the records are retained in the Kinesis stream

Oh, I thought the whole batch was only retried if there was an error.

mhart commented 8 years ago

Also, exponential backoff is already implemented for certain scenarios in the SDK, so it should already be expected that the Lambda execution could take longer anyway.

melwin commented 8 years ago

Hmm... Looking at this again - I think you are right. There is a situation where it might return UnprocessedItem but without an error. The code was written expecting an error in that case causing the whole batch to be retried. I will do some testing and update the code. Thanks again!

mhart commented 8 years ago

Any updates on this?

mhart commented 8 years ago

What's the status on this issue?

johnnybsd commented 8 years ago

We are currently looking into this and fix it in the near future.

mikedeck commented 8 years ago

I've updated the code in the processor function to retry failed items. We're also working on some more detailed documentation about the failure semantics of the AWS Lambda/Amazon Kinesis event source mapping. I'll update with a link when that is ready.

mikedeck commented 8 years ago

Here is a link to some updated documentation that describes the retry policy for Lambda functions: http://docs.aws.amazon.com/lambda/latest/dg/intro-core-components.html#retries-on-errors. The third bullet point is what's relevant to this code sample:

If the Lambda function is subscribed to an Amazon Kinesis or DynamoDB stream and throws an exception, Lambda will attempt to process the erring batch of records until the time the data expires, which can be up to 7 days for Amazon Kinesis. The exception is treated as blocking; Lambda will not read any new records from the stream until the failed batch of records either expires or succeeds. This way Lambda maintains the order in which events are processed from the stream.

mhart commented 8 years ago

Great, thanks all!