dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.07k stars 2.03k forks source link

Reconnect after silo restart #1769

Closed FenrisWolfAtMiddleEarth closed 8 years ago

FenrisWolfAtMiddleEarth commented 8 years ago

It appears that our Orleans Clients do not automatically reconnect after silo(s) have been restarted. Is this as expected? We configure the client programmatically with very simple code like this:

    private static ClientConfiguration ProdConfiguration()
    {
        return new ClientConfiguration
        {
            Gateways = new List<IPEndPoint>
            {
                new IPEndPoint(FindPreferredEndPoint("Server1"), 40000),
                new IPEndPoint(FindPreferredEndPoint("Server2"), 40000)
            },
        };
    }

After the silo(s) have been restarted, our client just keeps getting a "No gateways available" exception. It never recovers from this error. Is there a good pattern for solving this? Should I catch the exception, and then Uninitialize and Initialize the GrainClient again?

gabikliot commented 8 years ago

If I remember correctly, we indeed have this bug with static gateway not automatically reconnecting. It's basically because we never use static gateway in prod, only in test. All the other gateway types that use system store are auto recconecting. We should fix that indeed.

FenrisWolfAtMiddleEarth commented 8 years ago

Wow. You fixed it already? I am very much looking forward to your fix being available from nuget.

FenrisWolfAtMiddleEarth commented 8 years ago

I can see / guess from your code changes that a reconnect requires a poll for silos, and that there is a default poll interval of one minute. Is that correct?

A minute is a very very long time in a heavy transaction system. If you have hundreds of transactions pr. second, then one minute of downtime will cause a LOT of failed transactions.

Do you see any problems in lowering the poll interval to maybe 1 second? Or even lower? If we use a static list of gateways, then there is no access to persistent storage, like SQL-server, so a poll for silos should be cheap.

Also if the silos are shut down gracefully, the socket on the client will be notified that the counterpart has closed the connection. Will that trigger a poll and / or reconnect?

Thanks

gabikliot commented 8 years ago

It indeed polls at that frequency. You will have up to 1 minute full disconnect ONLY if all silos are restarted at once, which usually almost never happens in the Cloud. Azure as well Google Cloud for example restart VMs in parts (fault domains in Azure, Zones in Google Cloud), never all together (up to some SLA guarantee of course). So basically "service availability" of silos is provided by the hosting framework.

In your case you are hosting yourself, you are updating/restarting yourself, so its up to you to manage that. If you restart all at once, then indeed you will have up to 1 minute disconnect. But it also depends on how long it takes for silos to restart/reboot. If it takes lets say 20 sec, then you will only have an additional 40 sec.

You can decrease it to 1 sec. Basically, this is why it is configurable. The default setting of 1 min works well for Azure or other auto-managed hosting infrastructure. If you are building your own, you can/should definitely play with that value.

FenrisWolfAtMiddleEarth commented 8 years ago

So if silos are restarted one site at a time, there will never be any downtime. Does it need any delay between restart of silos? I.e. should I wait the poll interval between silo restarts?

E.g. If Silo 1 restarts in 5 seconds, and I then immediately restart silo 2, which also takes 5 seconds. Will the clients then have a connection to any or both of the silos? Or should I wait the poll interval between restarts?

gabikliot commented 8 years ago

BTW, you are not by any chance using GrainMembershipTable? That option is of course not to be used for prod, only for testing. That is mentioned multiple times in our docs. And if you are not using it, and using any of the other persistent system stores, it's not clear why you need StaticGatewayListProvider. That is also the reason why it never needed to be made updatable. There is simply no scenario when it needs to be.

FenrisWolfAtMiddleEarth commented 8 years ago

We are using SQL-Server for Orleans Persistence in the Silo. We are just not accessing it from the client.

gabikliot commented 8 years ago

I am talking about system store for cluster management, not grain state persistence.

FenrisWolfAtMiddleEarth commented 8 years ago

Thanks for your advice, Gabriel. It is much appreciated. Yes. We are using SQL-server for cluster management. I have read the documentation (many times), and I think I have at least some understanding of how your cluster / quorum works. :-)

This is our programmatic setup of our configuration:

    private static ClusterConfiguration ProdConfiguration()
    {
        var config = new ClusterConfiguration
        {
            Defaults =
            {
                HostNameOrIPAddress = "localhost",
                Port = SiloPortno,
                ProxyGatewayEndpoint = new IPEndPoint(IPAddress.Any, ClientPortno),
            },
            Globals =
            {
                LivenessType = GlobalConfiguration.LivenessProviderType.SqlServer,
                ReminderServiceType = GlobalConfiguration.ReminderServiceProviderType.SqlServer,
                DeploymentId = DeploymentId,
                DataConnectionString = "Our SQL connection string",
            }
        };

        return config;
    }

But now, that you have been kind enough to give me good advice, perhaps you can help clear up one thing, which I am still uncertain about.

The DeploymentId, I add in the Clusterconfiguration.Globals is a node-id and not a cluster-id. Is that correct?

And the name I give in the constructor of the SiloHost is cluster-id, not a node-id. Is that correct?

gabikliot commented 8 years ago

If you are using SQL-server for cluster management, then you should also be using SQL-server for GatewayListProvider. That means you should NOT need StaticGatewayListProvider at all.

The DeploymentId, I add in the Clusterconfiguration.Globals is a node-id and not a cluster-id. Is that correct?

No, the other way around. There is no such concept is Orleans as node-id. You don't give you silos any ids. We do it all automatically. You do give the SiloHost a name for that silo, like Silo1, Silo2, ... but those can be anything, and don't even have to be unique. They are just for you to be able to differentiate between them in the log.

The DeploymentId is the id you give to this particular groups of silos, so they can work together. It is what I think you called cluster-id. It has to be same for all silos in your deployment.

And the name I give in the constructor of the SiloHost is cluster-id, not a node-id. Is that correct?

No, as explained above, it is the other way around.

FenrisWolfAtMiddleEarth commented 8 years ago

So the DeploymentId in the ClusterConfiguration.Globals should be the same across all silos in the cluster. Then I must be missing something. If I try to do that, then I get quite dramatic loggings from one of my silos.

19-05-2016 07:32:47,415 ERROR - I have been told I am dead, so this silo will commit suicide! I should be Dead according to membership table (in TryUpdateMyStatusGlobalOnce): myEntry = SiloAddress=S127.0.0.1:22222:201331909 InstanceName= Status=Dead. 19-05-2016 07:32:47,415 ERROR - INTERNAL FAILURE! About to crash! Fail message is: I have been told I am dead, so this silo will commit suicide! I should be Dead according to membership table (in TryUpdateMyStatusGlobalOnce): myEntry = SiloAddress=S127.0.0.1:22222:201331909 InstanceName= Status=Dead.

There are only two silos in this test setup

FenrisWolfAtMiddleEarth commented 8 years ago

I don't understand the Orleans clustering in depth, but could this be related to the setup of my listener for other silos? I can see in Ressource Monitor that my listener ( port 22222 ) is listenening for IPv4 loopback. If I remember correctly that means it will only accept local socket connections.

gabikliot commented 8 years ago

yes, that is correct. if you give them different deployment ids, they don't even talk to each other. The cluster is multiple singleton clusters.

1) Stop all silos. 2) clear up the table and make sure it is empty. 3) restart the silos. 4) never start 2 silos on the same machine. If you do want to do that, which we don't recommend, make sure you give them different ports, both silo to silo and proxy port. 5) make sure all silos can talk to each other - that is, make sure all firewall ports are open. 6) when you want to restart the silos, its better to stop them first gracefully, by calling Shutdown. If you cant Shutdown and just hard killing them, its OK, but you will see those msgs like above in the log. Still, as long as you 4 as said above, the silo will restart and all will work.

Essentially, I written above short instructions on how to run Orleans cluster on premises. You will need to basically scripts/build some kind of "deployment infrastructure", so you need to be aware and handle yourself all those small things that Cloud PaaS usually provides you.

gabikliot commented 8 years ago

I can see in Ressource Monitor that my listener ( port 22222 ) is listenening for IPv4 loopback. If I remember correctly that means it will only accept local socket connections.

yes, of course. It should listen on real non-loopback interface.

gabikliot commented 8 years ago
            HostNameOrIPAddress = "localhost",
            Port = SiloPortno,
            ProxyGatewayEndpoint = new IPEndPoint(IPAddress.Any, ClientPortno),

should not be localhost and Any of course.

FenrisWolfAtMiddleEarth commented 8 years ago

Setting the silo-listener to "0.0.0.0" solved the problem with the loopback address, however one of the two silos would still commit suicide. I am guessing that you are using the HostNameOrIPAddress property as a unique name for a silo. Is that correct?

So instead using the hostname seems to have solved the problem. I should have known that by just looking at then name of the property of course. Apologies for that.

I can see that on our hardware, Orleans will start a listener on IPv4. Not sure if your are just using the first address returned by the DNS or there is a preference for IPv4. However isn't it good style to start a listner on both IPv4 and IPv6, so it is prepared for IPv6 usage?

Other than that, my problem is solved. :-) Thanks again for your help.

gabikliot commented 8 years ago

I am glad that the issue was solved. We indeed open only one socket, on one interface and not both. You can specify which one.

FenrisWolfAtMiddleEarth commented 8 years ago

The bug-fix is in this report, has it been release to the newest version in Nuget?

TulkasLaugh commented 5 years ago

It indeed polls at that frequency. You will have up to 1 minute full disconnect ONLY if all silos are restarted at once, which usually almost never happens in the Cloud. Azure as well Google Cloud for example restart VMs in parts (fault domains in Azure, Zones in Google Cloud), never all together (up to some SLA guarantee of course). So basically "service availability" of silos is provided by the hosting framework.

In your case you are hosting yourself, you are updating/restarting yourself, so its up to you to manage that. If you restart all at once, then indeed you will have up to 1 minute disconnect. But it also depends on how long it takes for silos to restart/reboot. If it takes lets say 20 sec, then you will only have an additional 40 sec.

You can decrease it to 1 sec. Basically, this is why it is configurable. The default setting of 1 min works well for Azure or other auto-managed hosting infrastructure. If you are building your own, you can/should definitely play with that value.

How do you decrease it to 1 second while creating a ClientBuilder?