HumanCellAtlas / data-store

Design specs and prototypes for the HCA Data Storage System (DSS, "blue box")
https://dss.staging.data.humancellatlas.org/
Other
40 stars 6 forks source link

S3 SNS deliver failed to dss-indexer #1629

Closed Bento007 closed 5 years ago

Bento007 commented 6 years ago

it looks like there is a failure in the subscription to the AWS bucket. Notification are not being sent to indexer

Bento007 commented 6 years ago

In some cases no notification was sent to the SNS queue

8a13a019-c9d4-4a95-900b-4ba3dac64fe6 Stored in S3, No sync logs, no index logs 0d32dc96-7963-48e5-8d98-c19af48b9925 Stored in S3, Sync to GPC Failed, no index logs 7ee0d767-b690-4a5d-8c72-397f8aaf246f Stored in S3, Sync to GPC Failed, no index logs a81d6375-b622-4596-ac18-010a244e8bdf Stored in S3, Sync to GPC Failed, no index logs 752b4bb5-2406-43bb-84ff-0165f506ea7f Stored in S3, Sync to GPC Failed, no index logs b5ee4165-4cd7-4699-8eac-298d0a7ea94f Stored in S3, Sync to GPC Failed, no index logs b9c1bea2-9eef-4963-956f-22504d029235 Stored in S3, Sync to GPC Failed, no index logs c7a2b7f5-f757-47ff-bdd7-2324f6b02705 Stored in S3

kozbo commented 5 years ago

Trent created a lambda to control the scale test to simplify the DSS scale testing

Bento007 commented 5 years ago

This task would be simplified with having scale test data deleted. I also need to diversify the data being uploaded to the DSS. This will involve supplying the lambdas with different bundles already stored in and upload bucket.

kozbo commented 5 years ago

Amazon support ticket 5442487371 interactions so far:

Amazon Web Services no-reply-aws@amazon.com Tue, Oct 16, 1:41 PM (3 days ago) to czi-aws-admins+hca@chanzuckerberg.com, bhannafi@ucsc.edu, tsmith12@ucsc.edu, me

We have two event driven Lambdas that appear to suffer outages under heavy load.

For dss-index-integration, the event chain is s3->SNS->Lambda. For dss-sync-sfn-integration, the event chain is s3->SNS->SQS->Lambda.

Under scaling testing on 15-Oct, as well as during previous scale testing, each of these lambdas has experiences outages. There is no indication of SNS or SQS delivery failure, although there is evidence of unavailability. However, it is our understanding that Lambda should retry per AWS async lambda startup documentation, especially in the case of SQS.

A dead letter queue has been configured for each lambda mentioned above. No messages have landed in the dlq.

It is critical that these Lambdas perform reliably, especially dss-sync-sfn-integration. Any guidance is appreciated.

To contact us again about this case, please return to the AWS Support Center using the following URL:

https://console.aws.amazon.com/support/home?#/case/?caseId=5442487371&displayId=5442487371&language=en

(If you will connect by federation, log in before following the link.)

*Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue.

Amazon Web Services, Inc. is an affiliate of Amazon.com, Inc.

Some of the content and links in this email may have been generated by an Amazon customer. Amazon is not responsible for the contents or links within.

no-reply-aws@amazon.com no-reply-aws@amazon.com Wed, Oct 17, 4:00 AM (2 days ago) to czi-aws-admins+hca@chanzuckerberg.com, bhannafi@ucsc.edu, tsmith12@ucsc.edu, me

Hello,

Thank you for contacting AWS Premium Support. I am Abilashkumar and I will be assisting you with this case today.

I understand that you are facing issues while trying to test the lambda functions for scaling . Also you would like to know the reason for the Lambda functions not delivering the messages (payload) to the configured Dead letter Queue (DLQ) as the DLQ is empty. Please correct me if my understanding of the issue is wrong.

It was mentioned in your correspondence that the Lambda functions were experiencing "outages". Could you please clarify what you mean by "outage" ?

I checked the configurations for the Lambda functions "dss-index-integration" and "dss-sync-sfn-integration" at my end using our internal tools and found that the dead letter queue "arn:aws:sqs:us-east-1:861229788715:bhannafi-scale-test-dlq" were configured for both the Lambda functions.

To investigate the issue I checked the "Errors" graphs of both the Lambda functions. But, the graphs displayed only one error on the 15th of October for both the Lambda functions "dss-index-integration" [1] and "dss-sync-sfn-integration" [2]. This error seemed to have succeeded during the retries and hence no other errors were observed during the Lambda function execution. As a result, the function does not have a payload which it needs to deliver to the DLQ.

The Throttles and the Concurrency graphs for both the Lambda functions did not display any data points. The CloudWatch "Throttles" graph from 2018-10-14T00:00:00.000Z to 2018-10-17T00:00:00.000Z can be viewed using the links [3] and [4].

Regarding the retries, Lambda function retries twice when it encounters errors in case of asynchronous invocations and delivers the error messages to the Dead Letter Queue (DLQ) that is setup. The Dead letter Queue has its own CloudWatch metric known as "DeadLetterError". Whenever, the Lambda function encounters errors while delivering the message to the DLQ, this metric is incremented. Please refer [5] for additional information.

To verify if the Lambda function encounters errors while trying to deliver the payload to the DLQ, I checked the DeadLetterError graph for the Lambda functions for errors. But there was no error displayed in the graph. There were no data points observed in the last two weeks. Ideally, even if the errors persist after retries, the Lambda function must have delivered the data. Alternatively, if no payload is delivered, the Lambda function did not encounter any errors.

Hence, it would be more helpful if you could provide us information on what you mean by "outage" (experienced by the lambda function) as the errors graph and the throttles metric do not seem to convey much information.

If the issue you encounter is an error, please do get back to us with the detailed CloudWatch logs of both the Lambda functions.

Additionally, I could notice from our internal tools, that both the Lambda functions have Push Event Sources (API Gateway, S3 event source and the CW Logs event source). The permissions to invoke the Lambda function have been configured such that any account with the above mentioned Push event source will be able to invoke the Lambda functions.

This might cause serious security issues as anyone with Push event sources - API gateway, S3 and CW Logs will be able to invoke the Lambda functions. Also, it might lead the AWS account to incur very high billing charges. Hence, it is recommended to provide restricted permissions to invoke the Lambda functions unless your use-case requires an unrestricted access.

In case of any other queries or concerns, please feel free to reach out to us and we will be glad to assist further.

=============================== References:

[1] Errors: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2:graph=~%28region~%27us-east-1~metrics~%28~%28~%27AWS*2fLambda~%27Errors~%27FunctionName~%27dss-index-integration~%27Resource~%27dss-index-integration%29%29~period~300~stat~%27Sum~start~%272018-10-14T02*3a44*3a14Z~end~%272018-10-17T02*3a44*3a14Z%29

[2] Errors: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2:graph=~%28region~%27us-east-1~metrics~%28~%28~%27AWS*2fLambda~%27Errors~%27FunctionName~%27dss-sync-sfn-integration~%27Resource~%27dss-sync-sfn-integration%29%29~period~300~stat~%27Sum~start~%272018-10-14T02*3a44*3a27Z~end~%272018-10-17T02*3a44*3a27Z%29

[3] Throttles: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2:graph=~%28region~%27us-east-1~metrics~%28~%28~%27AWS*2fLambda~%27Throttles~%27FunctionName~%27dss-index-integration%29%29~period~300~stat~%27Sum~start~%272018-10-14T00*3a00*3a00Z~end~%272018-10-17T00*3a00*3a00Z%29

[4] Throttles: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2:graph=~%28region~%27us-east-1~metrics~%28~%28~%27AWS*2fLambda~%27Throttles~%27FunctionName~%27dss-sync-sfn-integration~%27Resource~%27dss-sync-sfn-integration%29%29~period~300~stat~%27Sum~start~%272018-10-14T00*3a00*3a00Z~end~%272018-10-17T00*3a00*3a00Z%29

[5] https://aws.amazon.com/blogs/compute/robust-serverless-application-design-with-aws-lambda-dlq

Best regards,

Abilashkumar P. Amazon Web Services

Check out the AWS Support Knowledge Center, a knowledge base of articles and videos that answer customer questions about AWS services: https://aws.amazon.com/premiumsupport/knowledge-center/?icmpid=support_email_category

We value your feedback. Please rate my response using the link below.

To contact us again about this case, please return to the AWS Support Center using the following URL:

https://console.aws.amazon.com/support/home#/case/?displayId=5442487371&language=en

(If you are connecting by federation, log in before following the link.)

*Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue.

==================================================================== Learn to work with the AWS Cloud. Get started with free online videos and self-paced labs at http://aws.amazon.com/training/

Amazon Web Services, Inc. is an affiliate of Amazon.com, Inc. Amazon.com is a registered trademark of Amazon.com, Inc. or its affiliates.

Amazon Web Services no-reply-aws@amazon.com Wed, Oct 17, 10:13 AM (2 days ago) to czi-aws-admins+hca@chanzuckerberg.com, bhannafi@ucsc.edu, tsmith12@ucsc.edu, me

Thank you very much for the detailed investigation!

Our main issue is that "dss-index-integration" and "dss-sync-sfn-integration" are not triggered for all ObjectCreate events on the bucket "org-hca-dss-integration".

The event chain for "dss-index-integration" is ObjectCreate->SNS->Lambda The event chain for "dss-sync-sfn-integration" is ObjectCreate->SNS->SQS->Lambda

For some ObjectCreate events these lambdas are not invoked, and we cannot find any errors indicating why. The dlq was configured to diagnose this issue.

To contact us again about this case, please return to the AWS Support Center using the following URL:

https://console.aws.amazon.com/support/home?#/case/?caseId=5442487371&displayId=5442487371&language=en

(If you will connect by federation, log in before following the link.)

no-reply-aws@amazon.com no-reply-aws@amazon.com Thu, Oct 18, 2:35 AM (1 day ago) to czi-aws-admins+hca@chanzuckerberg.com, bhannafi@ucsc.edu, tsmith12@ucsc.edu, me

Hi,

Thank you for the reply, My name is Panashe and I will be the Support Engineer taking over your case from my colleague Abilashkumar today.

I understand that your functions "dss-index-integration" and "dss-sync-sfn-integration" failed to trigger from some S3 ObjectCreate events from bucket "org-hca-dss-integration" while you were scaling testing your architecture on the 15th of October.

Looking into your environment I can see that you have the following resources configured :

  1. S3 object : "org-hca-dss-integration"
  2. Notification configuration ID : ZTczMjZjNzItZThiNi00ZGVkLWJiMTYtZGRlM2MwODM3MTM4
  3. SNS topic : arn:aws:sns:us-east-1:861229788715: domovoi-s3-bucket-events-org-hca-dss-integration
  4. Function : dss-index-integration subscribed to topic domovoi-s3-bucket-events-org-hca-dss-integration
  5. SQS queue : arn:aws:sqs:us-east-1:861229788715: domovoi-s3-bucket-events-org-hca-dss-integration-dss-sync-sfn-integration
  6. Function : dss-sync-sfn-integration

With the following event chains :

  1. S3 ======> SNS =====> Lambda
  2. S3 =======> SNS ======> SQS ====== Lambda

After closely investigating your environment I can see that your current environment configuration is setup correctly, and looking at your lambda metrics I cannot see any evidence of failed invocations that can be directly related to the published messages as well. Looking further into your environment using my internal tools I can see that your SNS Topic : domovoi-s3-bucket-events-org-hca-dss-integration published 739 messages at 2018-10-15T04:35:00:00z UTC [1], however there were 18 696 lambda invocations from function "dss-index-integration" [2] and 2 912 invocation from function : dss-sync-sfn-integration at 2018-10-15T04:36:00:00z respectively [3]. From what I can see from these invocations I cannot accurately determine the number of missing invocations associated with the ObjectCreate events specifically and so I need will more information.

To aid me in troubleshooting this issue further could you please provide me with the RequestID and Extended requestIDs for the CreateObject events that failed to trigger the functions during your scale test period on the 15th of October ?.

This will enable us to test the amount of CreateObject events that were successful and unsuccessful, and work backwards to determine the cause of the failed invocation. If you used the AWS console to Create the objects you can provide me with a HAR file capture which will contain the requestIDs needed for us to investigate further [4].

Given the requested information, I hope that we will able to resolve this issue as soon as possible, if there is any further assistance or information that your require please do not hesitate to reach out. I will be more than happy to help.

Please see the link below more information : [1] SNS Topic Number of Published messages : https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#metrics:graph=~%28metrics~%28~%28~'AWS*2fSNS~'NumberOfMessagesPublished~'TopicName~'domovoi-s3-bucket-events-org-hca-dss-integration%29%29~period~60~stat~'Sum~start~'2018-10-14T05*3a59*3a13Z~end~'2018-10-15T05*3a59*3a13Z~yAxis~%28left~null~right~null%29~region~'us-east-1%29 [2] Function : dss-index-integration: Invocations at 04:36 UTC : https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#metrics:graph=~%28metrics~%28~%28~'AWS*2fLambda~'Invocations~'FunctionName~'dss-index-integration%29~%28~'AWS*2fLambda~'Invocations~'FunctionName~'dss-index-integration~'Resource~'dss-index-integration%29%29~period~900~stat~'Sum~start~'2018-10-11T05*3a21*3a39Z~end~'2018-10-18T05*3a21*3a39Z~yAxis~%28left~null~right~null%29~region~'us-east-1%29 [3] Function : dss-sync-sfn-integration: Invocations at 04:36 UTC : https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#metrics:graph=~%28metrics~%28~%28~'AWS*2fLambda~'Invocations~'FunctionName~'dss-sync-sfn-integration%29%29~period~300~stat~'Sum~start~'2018-10-15T00*3a21*3a51Z~end~'2018-10-16T05*3a21*3a51Z~yAxis~%28left~null~right~null%29~region~'us-east-1%29 [4] How to generate a HAR file : https://community.box.com/t5/Managing-Content-Troubleshooting/How-to-Generate-a-HAR-File-in-Chrome-IE-Firefox-and-Safari/ta-p/366

I look forward to you reply,

Best regards,

Panashe M. Amazon Web Services

Bento007 commented 5 years ago

The issues has not occurred in the last 2 weeks. Closing