The Orleans/ShoppingCart setup fails when try to deploy on Web App on Both Production and Staging slot

ArunBee84 commented 1 year ago

Issue descriptions Using the project Orleans/ShoppingCart, I was able to build and deploy successfully on Azure Web App. Also, to build a new Environment on Azure I used the bicep templates mentioned here. Everything worked fine till this point. Then I created a Staging Slot copying the production slot configuration and deployed the same ShoppingCart build. The Staging Slot Silo is stuck in the Joining state giving the below error.

Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S172.17.0.254:20106:31922292. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 172.17.0.254:20106. Error: AccessDenied
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTas`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224

If I open Kudu on Staging Slot, and try to ping the Production Silo address, I get the following exception:

PS C:\home> tcpping 172.17.0.254:20106
tcpping 172.17.0.254:20106
Connection attempt failed: An attempt was made to access a socket in a way forbidden by its access permissions 172.17.0.254:20106

Anyone got any ideas as to what could be wrong?

IEvangelist commented 1 year ago

I'm not entirely sure, let's ask @bradygaster and @ReubenBond for their thoughts on this.

IEvangelist commented 1 year ago

Seems to have been originally reported here: https://github.com/Azure-Samples/Orleans-Cluster-on-Azure-App-Service/issues/3

IEvangelist commented 1 year ago

Hi @ArunBee84 - thank you for posting this issue, I see that you also posted it on the Azure-Sample repo as well. Sorry for not seeing this sooner. In talking this over with the team, they're suggesting that you try configuring with two different silo names. That way they aren't trying to dial up the same cluster. Does that make sense?

ReubenBond commented 1 year ago

I would guess that your staging slot and production slot have the same ClusterId set, so they are trying to form a single cluster... but they cannot because they have no network connectivity between them.

windperson commented 1 year ago

You need to also set vnet integration in staging slot too, only create deployment slot and copy app configuration did not set vnet integration in the staging slot, I've made an example project that both production slot and staging slot have two instances running simultaneously, The Bicep code is here.

ArunBee84 commented 1 year ago

You need to also set vnet integration in staging slot too, only create deployment slot and copy app configuration did not set vnet integration in the staging slot, I've made an example project that both production slot and staging slot have two instances running simultaneously, The Bicep code is here.

Hi @windperson, Sorry I missed to mentioned before but I did verified the staging silo is also connected to the same subnet as the production silo. Here is the screenshot from Azure Storage Table showing the Silo Instances created.

The Silo with status joining is the Staging Instance. As you can see the IP Address is the same as the Production Silo but the status remains in Joining

andrethy commented 1 year ago

Hi @ArunBee84 - thank you for posting this issue, I see that you also posted it on the Azure-Sample repo as well. Sorry for not seeing this sooner. In talking this over with the team, they're suggesting that you try configuring with two different silo names. That way they aren't trying to dial up the same cluster. Does that make sense?

Colleague of @ArunBee84 here

The issue with that would be that we have a servicebus on the staging slot that connects to the same servicebus as the production slot, and would therefore activate grains, but it would be spawning new grains and not continue from the same state, since the grain will be in a new cluster. This can be circumvented by ensuring that the staging slot doesn't connect to our servicebus, hence not activating any grains, but we would like to figure out why the staging silo have no connectivity to the production cluster as @ReubenBond also mentions.

Do we know if this is expected behavior?

ArunBee84 commented 1 year ago

I would guess that your staging slot and production slot have the same ClusterId set, so they are trying to form a single cluster... but they cannot because they have no network connectivity between them.

Hi @ReubenBond, Both Production and Staging Slot are connected to the same Vnet/subnet with no NSG rules applied to them.

ArunBee84 commented 1 year ago

Hi Guys,

Any update on this issue.

bradygaster commented 1 year ago

Adding @btardif to this discussion to get additional eyes on this issue from the App Service team side. I'm going to deploy a slotted instance of this app to see if I can emulate this environmental setup and replicate the issue. @windperson - your Bicep code - does that represent the entire topology deployed in a slotted instance? If not, I think I'll update this sample to reflect just that, so I'd appreciate any pull requests if you have that Bicep available.

windperson commented 1 year ago

Hi @bradygaster The Bicep sample is from my Chine article that should be re-authored as a printed book in next few months. That will produce a lab resource group that has Azure resource as following picture: https://github.com/windperson/2022ithome_30days/blob/main/articles/day37/OrleansUrlShortener.svg Azure App service contains a production slot and a staging slot, each has two instances running, you can take that as part of official sample if you like 😄👌

bradygaster commented 1 year ago

In my fork of this repo, I've created a slots branch in which I create a slotted version of the site. I've tweaked the code so that this is the setup:

There exists a default subnet and a staging subnet
Each slot is in its own subnet
Each slot is in its own cluster
The cluster ID is a slot configuration setting, so the production slot always hit production data, and the staging slot always hits staging data

I know this is somewhat variant from the topology @ArunBee84 proposed, but wanted to see if this would mitigate some of the issues folks have run into in this thread and from the original issue. cc @btardif to see if he has any recommendations on this front, and @IEvangelist as I think it'd be good to create a PR to the main fork of this sample, but that would require some doc updates, so I'd prefer to coordinate those together.

dotnet / samples

The Orleans/ShoppingCart setup fails when try to deploy on Web App on Both Production and Staging slot #5625