Azure / Azure-Spring-Apps

Azure Spring Cloud
MIT License
8 stars 5 forks source link

Autoscale actions cause application downtime #51

Open bryandx opened 11 months ago

bryandx commented 11 months ago

Describe the bug After configuring autoscaling for a Spring App, when a scale in action occurs the application becomes unavailable because all running instances are terminated and new instance(s) meeting the effective scale in configuration are created. Because all existing instances are terminated, the application is no longer running and is not available to users for a short period of time while the first instance is starting.

A scale out action also results in all existing instances of the application being terminated and then new instance(s) meeting the effective scale in configuration are created. A scale out doesn't result in application downtime because a new instance is started before existing instance(s) are terminated, however for applications that store user httpsession state in memory, the end user of those stateful applications will need to log into the application again because their httpsession state is lost due to the termination of all existing application instances.

To Reproduce Steps to reproduce the behavior: Create auto scaling for a Spring App deployment using commands similar to these:

az monitor autoscale create --resource /subscriptions/xxxx/resourcegroups/yyyyy/providers/Microsoft.AppPlatform/Spring/zzzzz/apps/instashare-web/deployments/v1-3-0-105 --name instashare-web-autoscale-v1-3-0-105 --min-count 1 --max-count 2 --count 1   

az monitor autoscale profile create -g rg-lzcorpspring-prod-cus-01 --autoscale-name instashare-web-autoscale-v1-3-0-105 -n instashare-web-autoscale-business-hours --count 2 --min-count 2 --max-count 2 --end 20:00 --recurrence week mon tue wed thu fri sat --start 8:00 --timezone "Eastern Standard Time"   

When the start or end time of the instashare-web-autoscale-business-hours profile is met notice the behavior of the application instances regarding terminating existing ones and starting new ones.

Expected behavior When a scale in action occurs, at least one of the existing running application instances should be kept running and not terminated to avoid application downtime. Desired behavior is that existing instances should be stopped to meet the auto scale settings and not terminate all existing instances and then only startup new instances meeting auto scale settings.

When a scale out action occurs, all existing application instances running should be left alone and new instance(s) should be started to meet the auto scale settings

Can we contact you for additional details? Y/N Y

Sneezry commented 11 months ago

Hi @bryandx , I cannot reproduce this issue on my side:

scale-out

In my test resource, we successfully created a new instance while keeping the existing instance running, resulting in zero downtime for my test app. Were there any other operations performed, such as scaling up CPUs or memory?

bryandx commented 11 months ago

Our autoscale config didn't have any other rules using metrics. The 2 az monitor autoscale commands I provided were the ones I ran to configure autoscaling for the app (with proper subscriptionId and resourceGroups).

The one thing that I've wondered might be causing the termination behavior we're seeing is that we're utilizing these liveness and readiness probes:

liveness probe

{
    "probe": {
        "initialDelaySeconds": 5,
        "periodSeconds": 10,
        "timeoutSeconds": 3,
        "failureThreshold": 3,
        "successThreshold": 1,
        "probeAction": {
          "path": "/Instashare/actuator/health/liveness",
          "scheme": "HTTPS",
          "type": "HTTPGetAction"
        }
    }
}

readiness probe

{
    "probe": {
        "initialDelaySeconds": 5,
        "periodSeconds": 10,
        "timeoutSeconds": 3,
        "failureThreshold": 3,
        "successThreshold": 1,
        "probeAction": {
          "path": "/Instashare/actuator/health/readiness",
          "scheme": "HTTPS",
          "type": "HTTPGetAction"
        }
    }
}

I would hope that's not the issue but wanted to provide this info in case it is related.

bryandx commented 11 months ago

@Sneezry , application downtime only occurs on a scale in. The scale out doesn't result in downtime. Was your test for scale in or out?

In my next 2 posts I have screenshots of the sequence of events that we experience during scale in and out.

bryandx commented 11 months ago

@Sneezry I've attached screenshots of the sequence of events during a scale in that was just executed:

  1. Shows 2 instances of the application running before scale in occurs
  2. Shows both of those previously running instances being terminated while a new instance is being started. This is where the downtime starts because there are no running instances
  3. Shows the new instance still being started while the previously running instances are gone (been terminated).
  4. Shows the new instance finally started and downtime is over.

step1-appInstances-before-scalein step2-appInstances-scalein-started step3-appInstances-scalein-continuing step4-appInstances-scalein-completed

bryandx commented 11 months ago

@Sneezry I've attached screenshots of the sequence of events during a scale out that was just executed:

  1. shows 1 instance of the application running before scale out occurs
  2. shows a new instance being started
  3. shows the new instance from step 2 is running but the existing instance that was running prior to scale out is being terminated (why terminate it?)
  4. shows another new instance being started, the new instance started from step 2 is now running, and the original running instance from step 1 is gone (termination complete)
  5. shows the new instance created in step 4 is now running.

The end result is 2 running instances and no downtime. But why was the original running instance terminated?

step4-appInstances-scalein-completed step2-appinstances-scaleout-started step3-appinstances-scaleout-continuing step4-appinstances-scaleout-continuing2 step5-appinstances-scaleout-completed

Sneezry commented 11 months ago

@bryandx thanks for the detailed information. I understand your app is terminated unexpectedly when operating scale in, also, even scale out doesn't make any downtime, the existing app is also terminated, which does not make sense. I will try to reproduce this issue on my side based on the provided information and update once I have any findings.

Sneezry commented 11 months ago

@bryandx could you check if you/your team own an AAD app has id starts with 993e0e4b and ends with 2732bba36fa6? I see some Deployment Stop operations on behalf of that AAD app.

bryandx commented 11 months ago

@Sneezry Yes, that is for a Service Principal we use to automate our blue/green deployments and part of our automated deployment process is to stop the previous deployment. We had a deployment at 2:36pm (US Eastern time - the rest of the times in this post will be in US Eastern and the times on all screenshots I provided on 7/20 were in US Eastern time) yesterday (7/20). That was 30 minutes before the scale in and scale out events occurred so I don't think it's related. I've attached a screen shot of the activity log for the sequence of events related to this issue and you can see where I highlighted the Stop an App Operation which occurred at 2:37pm. Then you can see where the Autoscale scale down (in) was completed at 3:08pm and the Autoscale scale up (out) occurred at 3:20pm. There wasn't any Stop an app Operation during the scale in or scale out operations.

Are you seeing Deployment Stop operations other than the 2:36pm one on 7/20?

prodautoscaleactivitylog

Sneezry commented 11 months ago

Hi @bryandx you are correct, these operations are different things, and the issue is not caused by the stop operation. I need to confirm with the feature owner if this behavior is as expected. Could you create a support ticket in Azure Portal? I will follow up the ticket. Please request the support engineer to create an IcM, and ask the support engineer to assign the IcM to me (my alias is zlhe). Thanks.

Sneezry commented 11 months ago

Hi @bryandx, just to provide you with an update, I wanted to let you know that I am currently working alongside my team on further investigation regarding this matter. Rest assured that I'll keep you posted as soon as we have any significant findings. However, since we haven't been able to reproduce this issue in our test environment, it might take us a bit longer to troubleshoot. Thank you for your understanding and patience.

zhiszhan commented 11 months ago

[like] Zhishou Zhang reacted to your message:


From: Zhe Li @.> Sent: Monday, July 24, 2023 6:00:30 AM To: Azure/Azure-Spring-Apps @.> Cc: Assign @.>; Subscribed @.> Subject: Re: [Azure/Azure-Spring-Apps] Autoscale actions cause application downtime (Issue #51)

Hi @bryandxhttps://github.com/bryandx, just to provide you with an update, I wanted to let you know that I am currently working alongside my team on further investigation regarding this matter. Rest assured that I'll keep you posted as soon as we have any significant findings. However, since we haven't been able to reproduce this issue in our test environment, it might take us a bit longer to troubleshoot. Thank you for your understanding and patience.

— Reply to this email directly, view it on GitHubhttps://github.com/Azure/Azure-Spring-Apps/issues/51#issuecomment-1647265024 or unsubscribehttps://github.com/notifications/unsubscribe-auth/AOE2PSHXN2AQVRSFUKJ7HBLXRYFP7BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVEZDANZSGQ4TEMZVQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHAYTEMRYG4ZTQN5HORZGSZ3HMVZKMY3SMVQXIZI. You are receiving this email because you were assigned.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bryandx commented 11 months ago

@Sneezry , the Microsoft support Id for this you asked for is 2307210040009716

Sneezry commented 11 months ago

@bryandx the support engineer has created IcM and assigned it to me in the internal system.

Sneezry commented 11 months ago

Hi @bryandx , we currently suspect that this issue is related to the probe. We are currently conducting in-depth research, and once we have further suggestions and follow-up, I will update promptly.

Sneezry commented 11 months ago

Hi @bryandx We are currently in the process of implementing a solution to resolve this matter. The fix is expected to be completed and made available by approximately mid-August.

cc @zhiszhan

bryandx commented 11 months ago

@Sneezry Will the solution also address how scale out terminates existing running instance(s) instead of leaving those instance(s) running and simply adding more to meet the scale out requirements?

Sneezry commented 11 months ago

Yes, both scale in and scale out issues will be fixed.