awslabs / aws-lambda-redshift-loader

Amazon Redshift Database Loader implemented in AWS Lambda
Other
597 stars 164 forks source link

process s3 event messages from SQS #215

Open cjxkb210 opened 4 years ago

cjxkb210 commented 4 years ago

Added ability to consume s3 event messages from an SQS queue. This facilitates the option of having multiple consumers of s3event notifications for the same prefix. You could for example subscribe an SNS topic to the s3event for a given prefix and then have multiple sqs queues subscribed to that topic. This expands usage options for multiple varying consumers of the same s3 prefix event.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

cjxkb210 commented 4 years ago

Hi,

We'd like to expose the option of having multiple consumers of s3 events for a given prefix. AWS only allows you to have one consumer of s3 events for a given prefix. This consumer can be a Lambda, SQS, or SNS topic.

If the RedshiftLambdaLoader could consume s3 events indirectly via an SQS queue this would open the option of using the SNS topic as the subscriber of s3 events. In this case we could subscribe multiple queues, for example if you wanted to load the data into Redshift via the RedshiftLambdaLoader but independently consumer from a separate queue with a process that ingests the data into an index.

I've created the attached PR for review. Please note that in my example I've also allowed the processing of multiple sqs messages in the same request. In my testing I enabled raw message delivery on an SNS topic, to which I subscribed an SQS queue that the RedshiftLambdaLoader could consume from. This minimised the impact of the changes.

Please note, I'm not fully convinced that processing multiple sqs mesages in the same request is safe with the batch locking logic but I believe it may handle this although perhaps it was not intended to do so?

Could you review and let me know your thoughts on exposing such a feature? Would this approach cause you any concerns?

Thanks, Chris.

cjxkb210 commented 4 years ago

Hi @IanMeyers, would you mind reviewing my suggestion above?

Much appreciated, Chris.

IanMeyers commented 4 years ago

Sorry have been on holiday. Will review monday the 10th.

cjxkb210 commented 4 years ago

Thank you, have a great weekend.

cjxkb210 commented 4 years ago

after testing, I confirmed my initial concerns that processing the sqs messages in a tight loop will cause problems.

Failing further processing to Batch 7a2afa1f-7dc5-45e8-b0ad-677ac59738d1 which may be stuck in 'locked' state. If so, unlock the back using node unlockBatch.js <batch ID>, delete the processed file marker with node processedFiles.js -d <filename>, and then re-store the file in S3"

I'm still digging into the cause, I think a Redshift Maintenance window might be in play. Perhaps the batch was left in a locked state because the lambda timed out/did not mark the batch as failed.

IanMeyers commented 4 years ago

Lambda timeouts can indeed result in locked batches.

cjxkb210 commented 4 years ago

yeah, I've seen logs that look like this Remaining duration of -90597 insufficient to load cluster

It's possible I simply got unlucky and the function did not manage to fail the batch leaving it in a locked state.

I've had my changes running through via SQS loading batches for 12 hours now. The data I'm testing with gets placed onto s3 every 15mins.

cjxkb210 commented 4 years ago

I'll be testing this over a longer time period but this is ready for review now. I've increased my timeouts a little and added failed batch re-processing with SQS delayed delivery for the redshift maintenance windows.

cjxkb210 commented 4 years ago

interestingly I noticed this in the logs(I don't think is in anyway related to this PR)

2020-08-12T13:48:12.102+01:00 START RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46 Version: $LATEST

2020-08-12T13:48:12.115+01:00 info: Failing Batch 48e2e469-2fbd-4e98-99c6-9b028831a612 due to "Remaining duration of -89145 insufficient to load cluster"

2020-08-12T13:48:12.458+01:00 END RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46 REPORT RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46 Duration: 355.01 ms Billed Duration: 400 ms Memory Size: 256 MB Max Memory Used: 103 MB

I confirmed that the Remaining duration of -89145 insufficient to load cluster did pertain to the same request 1a5aa337-9f84-5f13-9486-f49a78bcbb46 by checking the batches table.

This means the context.getRemainingTimeInMillis() function returned -89145 13ms after request invocation when my lambda had been configured with a whopping 2min 30 second timeout. The number makes no sense and I'm not convinced getRemainingTimeInMillis is behaving nicely. I've upgraded my nodejs runtime from 10.x to 12.x in case there was an AWS defect. I could find a record of one, but I'll see if that eliminates the strange behavior around context.getRemainingTimeInMillis()

IanMeyers commented 4 years ago

hmm. I can't regress this. Also, if this consistently happened, then no one's loads would ever succeed. I think there's something else going on here, but don't see anything obvious in the code that could cause it - this is a single instance of the function being used, and while it accesses context via a parent reference, it shouldn't cause any consistency issues. Let me know what happens with 12.x?

cjxkb210 commented 4 years ago

yeah I've just seen the same issue with 12.x. Its infrequent, I've seen it 3 times in 48 batches.

IanMeyers commented 4 years ago

Can you please raise a support ticket?

cjxkb210 commented 4 years ago

yes, I'll raise a ticket for it, I'll spend a little more time this afternoon to see if I can spot anything that might explain it.

cjxkb210 commented 4 years ago

@IanMeyers this has gone through rounds of testing and is looking good. What are your thoughts of rolling this in?

Kind Regards, Chris.

IanMeyers commented 4 years ago

Thanks for the info. In general I'm supportive, but need to do a code review. I will book out some time tomorrow (18/9) to do so.

cjxkb210 commented 4 years ago

Thank you @IanMeyers