Rolling ugrades not working as expected

Azure / service-fabric-mesh-preview

Service Fabric Mesh is the Service Fabric's serverless offering to enable developers to deploy containerized applications without managing infrastructure. Service Fabric Mesh , aka project “SeaBreeze” is currently available in private preview. This repository will be used for tracking bugs/feature requests as GitHub issues and for maintaining the latest documentation.

MIT License

82 stars 11 forks source link

Rolling ugrades not working as expected #304

Open SamirFarhat opened 5 years ago

SamirFarhat commented 5 years ago

Hi all, I deployed a simple Mesh application, with 1 service. There are two replicas. 1- I have changed the replica count on my Template and deployed Expected Behavior : No downtime What really happened : A downtime, my site was down

2- I have 2 replicas. I changed the image tag i'm using and deployed my template Expected behavior : No downtime What really happened : A downtime, my site was down

Is this expected ?

Thanks

belmaiastar commented 5 years ago

Did you leave everything else as is and only changed replica count and image name? If so, please send us application resource id, region, and time when you issued the upgrade.

SamirFarhat commented 5 years ago

When i said downtime, i meant downtime for 30 seconds, which is something unexpected too. Yes, i have changed nothing except the image tag, or the replica count. Also noticed that the Gateway was taking long time during the deployment, like it was re-deploying.

SamirFarhat commented 5 years ago

ResourceID : /subscriptions/a7321ec3-3919-442b-8a85-3c8580527c41/resourcegroups/test-sfm/providers/Microsoft.ServiceFabricMesh/applications/helloapp

Region: eastasia

Time: between 7 and 8 pm utc+1 (last deployment at 7.43)

belmaiastar commented 5 years ago

I checked the log, at UTC 18:26;25 the upgrade was completed as a rollout upgrade.

When you mentioned downtime, was the website returns notReachable, or it was just delayed?

SamirFarhat commented 5 years ago

I was demoing something to my customer. And during the deployment, we hit the browser refresh button. We receive the browser page that says, this site is not accessible or something like that. I can demo it. So it was not reachable.

SamirFarhat commented 5 years ago

I uploaded a video, please see the behavior. https://www.youtube.com/watch?v=nSkuuhl89ws

arturenault commented 5 years ago

Hi Samir, it looks like we have a bug in how we allocate ports for Gateways. I'll fix it and deploy in the next few days.

Thanks!

guibirow commented 5 years ago

I was testing exposing multiple ports via Gateway on this issue https://github.com/Azure/service-fabric-mesh-preview/issues/315 and notice the same problem

When we apply an upgrade to the gateway, the existing routes stop working for a while and requests are not completed.

arturenault commented 5 years ago

What changes did you make to the gateway in this scenario? Are you just adding ports?

This issue was initially raised for application upgrades (which should work without a problem now) but there is some expected downtime when a gateway is changed.

guibirow commented 5 years ago

I've noticed that any changes on services would slow down or break the gateway, scenarios like scaling the service or upgrading it like mentioned above.

Also, adding new routes to the gateway put down every service behind it until update is complete.

I understand all these events have an impact in the gateway routes, as the services might move around, so I would expect the gateway to be more reliable and hot reload the routing configuration.

These kind of events will be very common and in the worse scenario only the related service\route should be affected, On scenarios where application updates happens multiple times a day, it would be unacceptable a gateway that fails on every release.

SamirFarhat commented 5 years ago

I noticed this en every app update like scaling the replica count or changing the code package container image version. Like i have showed on tge video. I will retry and report back

SamirFarhat commented 5 years ago

I have rested. Scaling from 1 to 2 replicas for example causes a downtime of few minutes. This is unusable is real life

aloneguid commented 5 years ago

Hence the version "preview". It's a serious issue though.

mattrowmsft commented 5 years ago

We are focusing on getting this fixed.

mattrowmsft commented 5 years ago

Just an update, the fix for this is being worked on, but holidays are slowing things down a bit. We have testing in place now so hopefully we can turnaround something quickly in January. Just FYI the expectation is that a single replica would maintain availability even during upgrades (upgrading container image for example).

julipur commented 5 years ago

@mattrowmsft This issue is still open and I am still experiencing similar issues. Is there a fix coming (your comment mentions January).