awslabs / aws-saas-boost

AWS SaaS Boost is a ready-to-use toolset that removes the complexity of successfully running SaaS workloads in the AWS cloud.
Apache License 2.0
961 stars 189 forks source link

All of a sudden, some of my active tenants appear to be `Failed`, even though they are active and working propertly. #384

Closed cfregly closed 1 year ago

cfregly commented 2 years ago

All of a sudden, some of my active tenants appear to be Failed,

image

even though they are active and working properly.

image image image

CloudFormation shows no issues - and the tenant has been working for 1-2 months.

image

This happened to me a few months ago, but I ignored it and rebuilt the environment.

I am in this state currently. I will preserve the environment to help gather more info. Just let me know what you need and I'll copy/paste here.

The state of the system seems to be messed up. Not sure how things go this way.

Thanks!

muylucir commented 2 years ago

Have you checked this log?

CloudWatch > Log groups > /aws/lambda/sb-\-tenants-events

2022-10-21 05:41:37.466 7f66cf87-2ee9-4477-986f-be1892eb906a INFO  TenantService - Handling Tenant Onboarding Status Changed
2022-10-21 05:41:37.466 7f66cf87-2ee9-4477-986f-be1892eb906a INFO  TenantServiceDAL - TenantServiceDAL::getTenant ab59241d-c212-4543-ae6a-791f84e80d04
2022-10-21 05:41:37.549 7f66cf87-2ee9-4477-986f-be1892eb906a INFO  TenantServiceDAL - TenantServiceDAL::getTenant exec 83
2022-10-21 05:41:37.549 7f66cf87-2ee9-4477-986f-be1892eb906a INFO  TenantService - Updating tenant ab59241d-c212-4543-ae6a-791f84e80d04 onboarding status from deploying to deployed
2022-10-21 05:41:37.549 7f66cf87-2ee9-4477-986f-be1892eb906a INFO  TenantServiceDAL - TenantServiceDAL::updateTenantOnboarding ab59241d-c212-4543-ae6a-791f84e80d04 deployed
2022-10-21 05:41:37.570 7f66cf87-2ee9-4477-986f-be1892eb906a INFO  TenantServiceDAL - TenantServiceDAL::updateTenantOnboarding exec 21
END RequestId: 7f66cf87-2ee9-4477-986f-be1892eb906a
REPORT RequestId: 7f66cf87-2ee9-4477-986f-be1892eb906a  Duration: 125.05 ms Billed Duration: 126 ms Memory Size: 512 MB Max Memory Used: 171 MB

The onboarding status of the Tenant changes, it is recorded here. If you haven't checked, you'd better check.

PoeppingT commented 1 year ago

Hey @cfregly , did you get the chance to investigate the reason why your tenants moved to failed?

cfregly commented 1 year ago

i haven't found anything specific, no. I did notice that ECS is showing "In Progress..." even though it's stable. kicking ECS again with a fresh Docker build. maybe that will fix the status in the SaaS Boost UI.

brtrvn commented 1 year ago

When you say ECS is showing In Progress... what do you mean? That the Service(s) status is changed or that the Task status under the service has changed? This may be due to your tasks flapping. Are you sure tasks aren't shutting down and being relaunched by ECS? If you look at the tasks for a service, do you only see 2 (the original task def that CloudFormation created and the task def that replaced it when initial workload deployment happened) or do you see many? Are you pushing new images to the application's ECR repo? Does every CodePipeline succeed? If you look at the pipeline history for your tenants do you see any failures?

brtrvn commented 1 year ago

Closing due to inactivity. Please reopen if the problem is reproducible.