Open cjxkb210 opened 4 years ago
Hi,
We'd like to expose the option of having multiple consumers of s3 events for a given prefix. AWS only allows you to have one consumer of s3 events for a given prefix. This consumer can be a Lambda, SQS, or SNS topic.
If the RedshiftLambdaLoader could consume s3 events indirectly via an SQS queue this would open the option of using the SNS topic as the subscriber of s3 events. In this case we could subscribe multiple queues, for example if you wanted to load the data into Redshift via the RedshiftLambdaLoader but independently consumer from a separate queue with a process that ingests the data into an index.
I've created the attached PR for review. Please note that in my example I've also allowed the processing of multiple sqs messages in the same request. In my testing I enabled raw message delivery on an SNS topic, to which I subscribed an SQS queue that the RedshiftLambdaLoader could consume from. This minimised the impact of the changes.
Please note, I'm not fully convinced that processing multiple sqs mesages in the same request is safe with the batch locking logic but I believe it may handle this although perhaps it was not intended to do so?
Could you review and let me know your thoughts on exposing such a feature? Would this approach cause you any concerns?
Thanks, Chris.
Hi @IanMeyers, would you mind reviewing my suggestion above?
Much appreciated, Chris.
Sorry have been on holiday. Will review monday the 10th.
Thank you, have a great weekend.
after testing, I confirmed my initial concerns that processing the sqs messages in a tight loop will cause problems.
Failing further processing to Batch 7a2afa1f-7dc5-45e8-b0ad-677ac59738d1 which may be stuck in 'locked' state. If so, unlock the back using node unlockBatch.js <batch ID>
, delete the processed file marker with node processedFiles.js -d <filename>
, and then re-store the file in S3"
I'm still digging into the cause, I think a Redshift Maintenance window might be in play. Perhaps the batch was left in a locked state because the lambda timed out/did not mark the batch as failed.
Lambda timeouts can indeed result in locked batches.
yeah, I've seen logs that look like this Remaining duration of -90597 insufficient to load cluster
It's possible I simply got unlucky and the function did not manage to fail the batch leaving it in a locked state.
I've had my changes running through via SQS loading batches for 12 hours now. The data I'm testing with gets placed onto s3 every 15mins.
I'll be testing this over a longer time period but this is ready for review now. I've increased my timeouts a little and added failed batch re-processing with SQS delayed delivery for the redshift maintenance windows.
interestingly I noticed this in the logs(I don't think is in anyway related to this PR)
2020-08-12T13:48:12.102+01:00 START RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46 Version: $LATEST
2020-08-12T13:48:12.115+01:00 info: Failing Batch 48e2e469-2fbd-4e98-99c6-9b028831a612 due to "Remaining duration of -89145 insufficient to load cluster"
2020-08-12T13:48:12.458+01:00 END RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46 REPORT RequestId: 1a5aa337-9f84-5f13-9486-f49a78bcbb46 Duration: 355.01 ms Billed Duration: 400 ms Memory Size: 256 MB Max Memory Used: 103 MB
I confirmed that the Remaining duration of -89145 insufficient to load cluster did pertain to the same request 1a5aa337-9f84-5f13-9486-f49a78bcbb46 by checking the batches table.
This means the context.getRemainingTimeInMillis() function returned -89145 13ms after request invocation when my lambda had been configured with a whopping 2min 30 second timeout. The number makes no sense and I'm not convinced getRemainingTimeInMillis is behaving nicely. I've upgraded my nodejs runtime from 10.x to 12.x in case there was an AWS defect. I could find a record of one, but I'll see if that eliminates the strange behavior around context.getRemainingTimeInMillis()
hmm. I can't regress this. Also, if this consistently happened, then no one's loads would ever succeed. I think there's something else going on here, but don't see anything obvious in the code that could cause it - this is a single instance of the function being used, and while it accesses context
via a parent reference, it shouldn't cause any consistency issues. Let me know what happens with 12.x?
yeah I've just seen the same issue with 12.x. Its infrequent, I've seen it 3 times in 48 batches.
Can you please raise a support ticket?
yes, I'll raise a ticket for it, I'll spend a little more time this afternoon to see if I can spot anything that might explain it.
@IanMeyers this has gone through rounds of testing and is looking good. What are your thoughts of rolling this in?
Kind Regards, Chris.
Thanks for the info. In general I'm supportive, but need to do a code review. I will book out some time tomorrow (18/9) to do so.
Thank you @IanMeyers
Added ability to consume s3 event messages from an SQS queue. This facilitates the option of having multiple consumers of s3event notifications for the same prefix. You could for example subscribe an SNS topic to the s3event for a given prefix and then have multiple sqs queues subscribed to that topic. This expands usage options for multiple varying consumers of the same s3 prefix event.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.