UseStaticClustering() hangs in when connect to remote cluster on ask

dotnet / orleans

Cloud Native application framework for .NET

https://docs.microsoft.com/dotnet/orleans

MIT License

10.1k stars 2.04k forks source link

UseStaticClustering() hangs in when connect to remote cluster on ask #6420

Closed panyang1217 closed 3 years ago

panyang1217 commented 4 years ago

I deployed my silo on aks and exposed a public endpoint (ex: 20.X.X.X:33001) to silo's gateway port(33001) with k8s service.

Then my client try to connect to that cluster with that public endpoint, the client hangs when call client.Connect() method.

On the silo I found the logs as blow:

info: Microsoft.Orleans.Networking[0] Connection Local: 10.244.0.237:57994, Remote: 20.X.X.X:33001, ConnectionId: 0HLUCRD4PJMLU established with S20.X.X.X:33001:0 [18:39:07 INF] Starting to process messages from remote endpoint 20.X.X.X:33001 to local endpoint 10.244.0.237:57994 [18:39:07 INF] Starting to process messages from local endpoint 10.244.0.237:57994 to remote endpoint 20.X.X.X:33001 [18:39:07 INF] Closing connection with remote endpoint 10.244.0.1:57994. Exception: System.InvalidOperationException: Unexpected direct silo connection on proxy endpoint from S10.244.0.237:33000:322410647 at Orleans.Runtime.Messaging.GatewayInboundConnection.RunInternal() at Orleans.Runtime.Messaging.Connection.Run() System.InvalidOperationException: Unexpected direct silo connection on proxy endpoint from S10.244.0.237:33000:322410647 at Orleans.Runtime.Messaging.GatewayInboundConnection.RunInternal() at Orleans.Runtime.Messaging.Connection.Run()

Dose any one has an idea of this exception?

ReubenBond commented 4 years ago

Could you provide more details about your setup? Eg, silo and gateway ports for each silo in your cluster.

When you say "exposed a public endpoint", what does that mean? You should not expose your silo to the Internet directly - those ports are for internal access only.

13346437240 commented 4 years ago

@ReubenBond Thank for your reply.

For some reason, I want use client on local site to connect the cluster, so I create a kubernetes service which use LoadBalancer Type to expose a gateway endpoint with an public IP like I said above (20.X.X.X:33001).

This is my config in aks:

Name:                     silopddtestcafb
Namespace:                silopddtestcafb
Labels:                   app=silopddtestcafb
Annotations:              azure-pipelines/jobName: "Agent phase"
                          azure-pipelines/org: https://dev.azure.com/n2ib/
                          azure-pipelines/pipeline: "n2ib-pddtest-aks - CD"
                          azure-pipelines/pipelineId: "1"
                          azure-pipelines/project: silo-pddtest
                          azure-pipelines/run: 85
                          azure-pipelines/runuri: https://dev.azure.com/n2ib/silo-pddtest/_releaseProgress?releaseId=85
                          kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"silopddtestcafb"},"name":"silopddtestcafb","namespace":"...
Selector:                 app=silopddtestcafb
Type:                     LoadBalancer
IP:                       10.0.98.140
LoadBalancer Ingress:     20.X.X.X
Port:                     gateway  33001/TCP
TargetPort:               33001/TCP
NodePort:                 gateway  31195/TCP
Endpoints:                10.244.0.252:33001
Session Affinity:         None
External Traffic Policy:  Cluster

my silo pod config:

Name:           silopddtestcafb-7d74b48887-prwqj
Namespace:      silopddtestcafb
Priority:       0
Node:           aks-agentpool-30862896-vmss000000/10.240.0.4
Start Time:     Sat, 21 Mar 2020 11:33:25 +0800
Labels:         app=silopddtestcafb
                pod-template-hash=7d74b48887
Annotations:    azure-pipelines/jobName: "Agent phase"
Status:         Running
IP:             10.244.0.252
IPs:            <none>
Controlled By:  ReplicaSet/silopddtestcafb-7d74b48887
Containers:
  silopddtestcafb:
    Container ID:   docker://5432f855c5c6d8b6cd1b5eb52128657a6774ded50ea60fdd186239ed4a1bbb0f
    Image:          xxxxxxx.azurecr.io/xxxxxxxxx:88
    Image ID:       docker-pullable://xxxxxxx.azurecr.io/xxxxxxxxxxx@sha256:d6fafa77b6b6dad39465f0e587ca9a9de325045a1b9fe79997bc175211a484d7
    Ports:          33000/TCP, 33001/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Sat, 21 Mar 2020 11:33:29 +0800
    Ready:          True
    Restart Count:  0

KevinCathcart commented 4 years ago

Ok. Exposing a port from kubernetes for access from a specific network is fine. You should make sure you use the other azure network features to make sure only the specific network can access that port.

However it is not fully clear what your cluster setup looks like. I can tell from the logs though that it is somehow messed up. What I can see from those logs is that the silo at 10.244.0.237:33000 tried to connect to 20.X.X.X:33001 as though that were the silo port of another silo in the cluster. The load balancer directed that back to the same silo, but 33001 is the gateway port, not the silo-to-silo communication port, so the silos connection to itself was rejected because it was connecting on the wrong port.

This means something is really messed up with the silo-side cluster configuration, as a silo should never be trying to connect to a gateway port as if it were the normal silo-to-silo port. Nor should it be trying to connect to itself. It is possible that your cluster configuration is seriously messed up, or it is possible that bad records in the membership storage from some previous misconfiguration is causing a problem.

Can you tell us which membership provider you are using on the silo, and what options you you have configured?

13346437240 commented 4 years ago

I use AzureStorage Membership, this is my config for silo membership and endpoint:

// SiloPort: 33000
// GatewayPort: 33001
builder.ConfigureEndpoints(int.Parse(cfg["EndpointOptions:SiloPort"]), int.Parse(cfg["EndpointOptions:GatewayPort"]), listenOnAnyHostAddress: true);
builder.UseAzureStorageClustering(options => options.ConnectionString = cfg.GetConnectionString(clusterConStr));

I cleared the membership table ,stopped all frontend and just started 1 silo in cluster. Then just use the client to connect to cluster, that exception still there. below is the log since silo startup, I hope this is clear for you.

I replace the public IP address with "20.X.X.X"

In this log, I also found that silo's gateway endpoint always tried connect the network gateway 10.244.0.1, and got an exception too.

silo.log

KevinCathcart commented 4 years ago

That configuration looks right.

Ah. I see what is going on. There is one case in which the exact address that a client used to make the connection actually matters (when the client is sending a message to a systemTarget on the gateway it is connected to).

(Note: I know the following will probably not make much sense to you. I am describing it for the benefit of the project contributors.)

If the client does not otherwise specify a TargetSilo, the ClientOutboundConnection will set it to the gateway's address, including the gateway port.

The gateway itself will ignore the value of TargetSilo that the client set, in all cases except one. If the message is for a SystemTarget, Then the gateway needs to decide if the client set the address explicitly (in which case it is an internal silo address with silo port, etc), or if it was the auto-set address [0]. To detect an auto-set address, the Gateway compares it to what the it thinks its own address is.

If the client and silo disagree about the IP address of the gateway, it will not detect an auto-set TargetSilo as auto-set, and will try to open a connection to it. Since the auto-set value uses the gateway IP address, the result is connecting back to itself on the gateway. That is exactly what we are seeing here.

The question now is how to avoid this for the case a a cluster inside Kubernetes and a client outside of it. By nature, this means IP address of the gateway as used by the client will not match the IP address by the silos to connect to each other.

It is possible to configure the client to use the internal Kubernetes address as the gateway address, but a different one as the actual address for creating a TCP connection (feasible via connection middleware), but that is really the wrong way around. The client should not need to know the internal Kubernetes address of the gateway.

Unfortunately on the silo side, the code assumes that the same Advertised IP address should be used for both the silo and gateway connections.

I think the simplest Orleans-side fix would be to just not auto-set TargetSilo in ClientOutboundConnection, since the only time the gateway even looks at that value is for SystemTargets, and it goes out of its way to try to ignore the auto-set value. For non-systemtargets (and non- client-to-client messages), it nulls the TargetSilo, and passes the message to the dispatcher to let it sort things out.

With that change, (unless I am missing something) the silo never even learns what address the client thinks its gateway has, and we avoid the whole mess.

benjaminpetit commented 4 years ago

Very good analyze @KevinCathcart . Unfortunately, the client has to be able to contact a SystemTarget that lives on a specific silo for some scenarios (like Streaming)., so there is no easy fix

IMO having a LoadBalancer between Orleans Client and Gateways is not a good idea. Why do you need one in this scenario?

The clustering providers (Azure, ect.) already have some fallback and round robin mecanism. The client is capable to see if a silo is down and will try to reroute pending messages to another live gateways. Depending on the LoadBalancer you use it can cause more harm here, since the client will only see one connection.

Can clients and silo live in the same subnetwork? If not, can you at least have some peering between their subnets? I know this is possible in Azure.

KevinCathcart commented 4 years ago

The original Kubernetes design intent is that pod IP addresses are not routeable outside of the Kubernetes cluster. All of the built-in ways of exposing a service in Kubernetes to the outside world will result in the service externally appearing with a different IP address than internally. While I know at least one of the third party Kubernetes network implementations does support exposing the internal subnets by BGP peering with your external router, and some clouds support using their native SDN stacks as kubernetes networking providers, I do not believe that sort of functionality is consistently available in all hosted kubernetes offerings. (Azure does support using its own SDN stack inside kubernetes, but it is more complicated than the default kubenet networking, and the original poster would still need to set up point-to-site vpn on the client machines, or a site-to-site VPN connection to azure and peer with his corporate network, or set up ExpressRoute and peer with his corporate network.)

The “hard” solution is to update the clustering system to allow advertising a separate Silo IP, and Gateway IP. This would also need to update the configuration system to allow specifying both. Amusingly it is already possible to specify different listen endpoints for silo and gateway, but just not separate advertised addresses.

But I’m not convinced that there is no easy solution. Sure, we need to maintain the ability to send messages from the client to a specific SystemTarget, but as long as the client to gateway message uses the “internal” silo endpoint, that will work fine. The gateway’s special SystemTarget routing rule comment even calls that out as being needed for SystemTargets that are not on a gateway. And despite looking I could not find any code that was creating a SystemTarget grain reference on the client using the gateway address. If these streaming cases are using grain references that originated inside a silo, then those would use the internal silo endpoint and will work no matter what gateway they are sent to. If they really are creating SystemTarget grain references using the gateway endpoint, then yeah, my proposed easy fix would not work.

The only two cases I could find in the code where where a gateway endpoint (rather than an internal silo endpoint) would get set as the SiloTarget for the client-to-gateway message are:

Response messages created on the client. This appears to be designed to try to route responses back to the same gateway as the request. I think this is just an optimization. In any event, If the client does not think this address is actually a gateway endpoint, it will pick a gateway it is connected to at random.
At pretty much the last possible second before serializing and writing the message to the socket, ClientOutboundConnection will check if the SiloTarget is still null, and force it to be the gateway’s endpoint. This is always useless. If the message is not destined for a SystemTarget, The gateway will always ignore this value, and let the dispatcher calculate the silo, (or grab the correct target for those rare client-to-client messages). If the message is destined for a SystemTarget, the code explicitly tries to check for exactly this default value (or localhost with the gateway port), and will treat it the same as if it were null. (At least once one little check is added to avoid a null reference exception).

benjaminpetit commented 4 years ago

Azure does support using its own SDN stack inside kubernetes, but it is more complicated than the default kubenet networking, and the original poster would still need to set up point-to-site vpn on the client machines, or a site-to-site VPN connection to azure and peer with his corporate network, or set up ExpressRoute and peer with his corporate network.

Even if we make the change needed for this scenario to work, I would strongly suggest to setup such infrastructure to hit an internal LB. Gateway/Silo port are NOT supposed to be exposed to the world. Even with TLS & Cie.

The “hard” solution is to update the clustering system to allow advertising a separate Silo IP, and Gateway IP. This would also need to update the configuration system to allow specifying both. Amusingly it is already possible to specify different listen endpoints for silo and gateway, but just not separate advertised addresses.

This change would break backward compatibility. And complicate even more clustering. But that would be the cleaner solution.

ClientOutboundConnection will check if the SiloTarget is still null, and force it to be the gateway’s endpoint. This is always useless. If the message is not destined for a SystemTarget,

Yes, I suspect it changed when we switched to the new networking stack, it doesn't seems to be really used anymore.

But still, my point stand. You should not expose Orleans ports to the internet.

I suggest to change the architecture so the Orleans Clients can be hosted in the same cluster, and expose some sort of API (HTTP, ...) that will be exposed via the public LB. On this API you can setup authentication, encryption, filtering, ect. You don't have to deploy another container for that, you can use the hosted client: see the sample https://github.com/dotnet/orleans/tree/master/Samples/3.0/AspNetCoreCohosting

Separation between services will be better too.

The static clustering was not developed for usage with LB. I am not against modifying to make more scenarios to work, but I don't want to encourage bad usage either. And it's very likely that you will hit several other issues down the road.

KevinCathcart commented 4 years ago

I certainly am not advocating for using a load balancer at all, but being able to have clients connect through what amounts to a static NAT setup is certainly reasonable, and that is what kubernetes defaults for exposing services to the outside world really amount to, and what many people would be stuck with in an on-premise Kubernetes cluster, at least if they want to use a client outside of k8s. (I won’t defend the original poster’s proposed setup with exposing the gateway to the public internet through a load balancer)

I further don’t advocate for exposing Orleans over the public internet. I did clearly state earlier “You should make sure you use the other azure network features to make sure only the specific network can access that port.” By that I meant using the vpn features, or expressroute peering or whatever is applicable. At the time I did not want to stop and verify the specific Azure features available/applicable.

And yeah, obviously adding a new property to the clustering provider is a breaking change, although it need not necessarily be breaking for all clustering providers. Ones that use a rigid storage schema would inevitably break, but those that use a schemaless store might not. A new property/column for gateway advertised address could be optional, falling back to the existing advertised address property if not set. Of course breaking only some providers is still a breaking change, so this would probably need to wait for 4.0 if implemented at all.

While it would technically make clustering more complicated, I don’t think it would by terribly much. Obviously the client would look at the new property when determining the gateway address. However, almost everything silo-side would ignore the new property, since silos usually don’t care about other silo’s gateway addresses. One of the main exceptions would be the geo partitioned multi-cluster support, which does care. And the code to create the membership record would care too I suppose.

Now is it worth making such a change, or any change? It might not really be worth it. Using a connection middleware to remap from the cluster internal address to any statically NATed external address is easy enough to do. It won’t magically make a load balancer work unless it is balancing only one silo, but I also don’t feel supporting that case is important.

panyang1217 commented 4 years ago

Thanks @KevinCathcart @benjaminpetit of your analyzation and solution. Now I'm trying to use site-to-site vpn to resolve this problem.

In my case, my orleans cluster is hosted on aks which is in global azure. At the beginning my front web server is hosted on same aks to, that is the normal orleans deployment model, and work fine. But then I need to deploy the web server in azure webapp which is on another azure cloud for accelerating user accessing, so I tried to find some low cost solution with this, unfortunately it's not easy.

I tried to expose orleans with http api, but it have to change amount code of my web server, and make the internal web server(in same aks cluster) having more overhead for calling.
Then I'm tring to using site-to-site vpn to make internally network connectivity, but it have to involve vpn gateway configuration and virtual network integration with webapp configuration, finally it make the network configuration more complicate and more costs.

So as @KevinCathcart said the NAT is enough to expose limited and expected endpoint of cluster to outside, and that can make the client configuration easier and clearer for some cross cluster deployment and cross cloud deployment.

Of course, my case is not a common case, and expose the orleans to public is also not a good idea.

KevinCathcart commented 4 years ago

Don’t worry too about the NAT related talk. That has more to do with other kubernetes hosts that are not Azure. With azure, AKS Azure CNI is available, which is a much better option.

But in order to understand your situation better, can you describe what you mean by on another azure cloud? Are you talking about in a different region, or do you mean not in the Azure global cloud? If not in the global cloud, are we talking about Azure Stack, or one of the isolated clouds like Azure Government, or Azure China?

In any case, If you are running the web server in some form of azure app service, then what you really want requires two parts. First, You need to use The AKS Azure CNI feature, to make your pods (including the silos) part of an azure VNet, in the global cloud. Second, you need to use app service VNet integration to let your web server access that VNet. If both of those are done, then you can configure the web server just like it is still inside the kubernetes cluster, since your web server would be able to directly see the silos. If you have direct connectivity from client to silo (even if only one way connectivity), you avoid the Orleans limitations in question.

If you are not in the same region, or are in an isolated cloud, you will probably need to use “gateway required VNet integration”. Because that uses a VPN style connection over the public internet, I expect it should work from an isolated cloud to the global one. If it does not, then I would contact Azure support, as they should be able to best assist in determining options for VNet integration for unusual scenarios.

Of course even if you get that set up, there is no guarantee that performance would be as good as you want if the silo and web server are not in the same azure region.

benjaminpetit commented 4 years ago

Note that you can enable vnet peering even on vnet that are not on the same region: https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-peering-overview (not sure if it's applicable to azure webapp)

power911 commented 4 years ago

hi, have same issue, with the error

Closing connection with remote endpoint 10.5.114.69:56092. Exception: System.InvalidOperationException: Unexpected direct silo connection on proxy endpoint from S127.0.0.1:11111:338544871
         at Orleans.Runtime.Messaging.GatewayInboundConnection.RunInternal()
         at Orleans.Runtime.Messaging.Connection.Run()
System.InvalidOperationException: Unexpected direct silo connection on proxy endpoint from S127.0.0.1:11111:338544871
   at Orleans.Runtime.Messaging.GatewayInboundConnection.RunInternal()
   at Orleans.Runtime.Messaging.Connection.Run()

We try connect from Unity3d to remote silo.

Sile has been started.txt

var builder = new SiloHostBuilder()
                .ConfigureClustering(
                    _serviceProvider.GetService<IOptions<OrleansConfig>>(),
                    Environment.GetEnvironmentVariable("ENVIRONMENT").ToLower()
                )
                .Configure<ClusterOptions>(options =>
                {
                    options.ClusterId = "dev";
                    options.ServiceId = "OrleansBasics";
                })
                .Configure<EndpointOptions>(options =>
                {
                    options.SiloPort = 11111;
                    options.GatewayPort = 30000;
                    options.GatewayListeningEndpoint = new IPEndPoint(IPAddress.Any, 40000);
                    options.SiloListeningEndpoint = new IPEndPoint(IPAddress.Any, 50000);
                })
                .UseDashboard(options => {
                    options.Username = "admin";
                    options.Password = "admin";
                    options.CounterUpdateIntervalMs = 1000;
                })
                .UsePerfCounterEnvironmentStatistics()
                .ConfigureApplicationParts(parts => parts.AddFromAppDomain())
                .ConfigureLogging(logging => logging.AddConsole());

            var host = builder.Build();
            return host;

SebastianStehle commented 4 years ago

Btw: I would always a put a API gateway between your clients and your cluster. Perhaps with GPRC or so.

ReubenBond commented 4 years ago

@SebastianStehle is right. You should never expose your cluster directly to the Internet or untrusted clients. Always use an API gateway

zahirtezcan-bugs commented 3 years ago

I do have a similar problem with a silo/database cluster within a docker-swarm and a client outside for connecting and testing feasability performance etc.

@ReubenBond Are you suggesting an architecture like this?:

             ============================================
            |                                            |
(MyApp)->:80|[WebApi (OrleansClient)] -> (Orleans-Silo)  |
            |                                            |
            |               the cluster                  |
             ============================================

If so, what if I need to grind on the edge of the performance and I don't need the web-api as an extra level of indirection? I can do that, because I don't need security as @KevinCathcart suggested, within on-premise setup.

youlxs commented 3 years ago

@panyang1217 Do you resolve this issue? I have the same problem with you.

panyang1217 commented 3 years ago

as ReubenBond mentioned above：

You should never expose your cluster directly to the Internet or untrusted clients. Always use an API gateway

If you set a public IP for orlean cluster and connect your client with this IP over internet, this is not proper architecture. Maybe you could use API gateway Orleans.HttpGateway.AspNetCore, or implement an api server by your self and then expose this api server to internet.

cqgis commented 3 years ago

I have same issue, use docker & k8s.

ReubenBond commented 3 years ago

As with #6895, please open a new issue if you are seeing a similar issue, @cqgis