dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.11k stars 2.04k forks source link

Docker Cluster on Multiple Hosts #3551

Open pms1969 opened 7 years ago

pms1969 commented 7 years ago

The Docker set up in the documentation is great if everything is running on one host, but if you are running a cluster across multiple host as in AWS ElasticBeanstalk, then the configuration can't be used, since the hosts need to communicate over the published ip/port.

To work around this, I thought I'd first add a "bind address" to the config and pass that around to the SocketManager, but quite frankly, I got lost in the implementation, and eventually reverted to a trybind on the configed address, followed by a bind to 0.0.0.0/port. This gets the node up and running, but then every 30 seconds I get these 2 messages in my logs:

[2017-10-13 13:46:20.567 GMT    32      WARNING 100157  CallbackData    10.11.3.211:10000]      Response did not arrive on time in 00:00:30 for message: Request S10.11.3.211:10000:245597809DeploymentLoadPublisherSystemTarget@S00000016->S10.11.3.211:10000:245597809DeploymentLoadPublisherSystemTarget@S00000016 #21: global::Orleans.Runtime.IDeploymentLoadPublisher:UpdateRuntimeStatistics(). Target History is: <S10.11.3.211:10000:245597809:DeploymentLoadPublisherSystemTarget:@S00000016>. About to break its promise.
Response did not arrive on time in 00:00:30 for message: Request S10.11.3.211:10000:245597809DeploymentLoadPublisherSystemTarget@S00000016->S10.11.3.211:10000:245597809DeploymentLoadPublisherSystemTarget@S00000016 #21: global::Orleans.Runtime.IDeploymentLoadPublisher:UpdateRuntimeStatistics(). Target History is: <S10.11.3.211:10000:245597809:DeploymentLoadPublisherSystemTarget:@S00000016>.
Exception = System.TimeoutException: Response did not arrive on time in 00:00:30 for message: Request S10.11.3.211:10000:245597809DeploymentLoadPublisherSystemTarget@S00000016->S10.11.3.211:10000:245597809DeploymentLoadPublisherSystemTarget@S00000016 #21: global::Orleans.Runtime.IDeploymentLoadPublisher:UpdateRuntimeStatistics(). Target History is: <S10.11.3.211:10000:245597809:DeploymentLoadPublisherSystemTarget:@S00000016>.
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.DeploymentLoadPublisher.<PublishStatistics>d__14.MoveNext()

[2017-10-13 13:46:20.573 GMT    12      WARNING 101802  DeploymentLoadPublisher 10.11.3.211:10000]      An exception was thrown by PublishStatistics.UpdateRuntimeStatistics(). Ignoring.
Exc level 0: System.TimeoutException: Response did not arrive on time in 00:00:30 for message: Request S10.11.3.211:10000:245597809DeploymentLoadPublisherSystemTarget@S00000016->S10.11.3.211:10000:245597809DeploymentLoadPublisherSystemTarget@S00000016 #21: global::Orleans.Runtime.IDeploymentLoadPublisher:UpdateRuntimeStatistics(). Target History is: <S10.11.3.211:10000:245597809:DeploymentLoadPublisherSystemTarget:@S00000016>.
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.DeploymentLoadPublisher.<PublishStatistics>d__14.MoveNext()

The cluster is running in a private subnet. The IP of the host is 10.11.3.211, and it's bound to port 10000.

It is worth noting that any communication with the host times out on the client:

System.TimeoutException : Response did not arrive on time in 00:00:30 for message: Request *cli/853e2989@2adc2691->S127.0.0.1:30000:0*grn/A157E9A7/00000000+3c32b035-0d34-4a85-b79b-14ed49838c3c #4:

On further examination, after turning the logging up to insane levels, the reason for those 2 log messages is that the messages are being stuck in a forwarding loop...

I tracked this down to the MessageCenter.Initialize call, where it creates the SiloAddress with the bound port, and not the Advertised one. So changed that appropriately, and everything seems to be behaving itself after I sorted out other environmental issues.

Can you see anything particularly wrong with this approach? Am I saving up undefined behaviour for some other time?

The full diff with sha 2cddf2fba09182e5baf4c17531092a19b5a4f82d, since I couldn't get head of master to work for me at all:

diff --git i/src/Orleans/Messaging/SocketManager.cs w/src/Orleans/Messaging/SocketManager.cs
index 64e09d88..30c45eed 100644
--- i/src/Orleans/Messaging/SocketManager.cs
+++ w/src/Orleans/Messaging/SocketManager.cs
@@ -7,6 +7,21 @@

 namespace Orleans.Runtime
 {
+    internal static class SocketExtensions
+    {
+        internal static bool TryBind(this Socket socket, IPEndPoint address)
+        {
+            try
+            {
+                socket.Bind(address);
+                return true;
+            }
+            catch
+            {
+                return false;
+            }
+        }
+    }
     internal class SocketManager
     {
         private readonly LRU<IPEndPoint, Socket> cache;
@@ -36,7 +51,11 @@ internal static Socket GetAcceptingSocketForEndpoint(IPEndPoint address)
                 s.LingerState = new LingerOption(true, 0);
                 s.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.ReuseAddress, true);
                 // And bind it to the address
-                s.Bind(address);
+                if (!s.TryBind(address))
+                {
+                    var local = new IPEndPoint(IPAddress.Parse("0.0.0.0"), address.Port);
+                    s.Bind(local);
+                }
             }
             catch (Exception)
             {
diff --git i/src/OrleansRuntime/Messaging/MessageCenter.cs w/src/OrleansRuntime/Messaging/MessageCenter.cs
index ac9eb590..6ae90592 100644
--- i/src/OrleansRuntime/Messaging/MessageCenter.cs
+++ w/src/OrleansRuntime/Messaging/MessageCenter.cs
@@ -64,7 +64,7 @@ private void Initialize(IPEndPoint here, int generation, IMessagingConfiguration

             SocketManager = new SocketManager(config);
             ima = new IncomingMessageAcceptor(this, here, SocketDirection.SiloToSilo, this.messageFactory, this.serializationManager);
-            MyAddress = SiloAddress.New((IPEndPoint)ima.AcceptingSocket.LocalEndPoint, generation);
+            MyAddress = SiloAddress.New(here, generation);
             MessagingConfiguration = config;
             InboundQueue = new InboundMessageQueue();
             OutboundQueue = new OutboundMessageQueue(this, config, this.serializationManager);

Happy to submit this as a pull request if you think it's worthy?

Cheers Paul

galvesribeiro commented 7 years ago

Hello Paul,

First of all thank you for the feedback.

I would like to understand more about you use case and I wonder if you can tell me more about your deployment.

To start:

  1. Are you using Swarm or Kubernetes?
  2. If in Swarm, are you sure your silos are all in the same Overlay network?
  3. What version of Orleans packages are you using?
  4. Can you share your dockerfile and/or compose file for all parties involved (Orleans client and Silo)?

I have a service in production running on top of Orleans with Swarm and it has several silos and clients and all work fine.

That error is indeed a networking issue. Most likely there is a problem on your setup.

pms1969 commented 7 years ago

Hi @galvesribeiro,

I'll answer as best I can:

  1. I'm using neither swarm or kubernetes. We deploy into AWS ElasticBeanstalks, and this scales out instances just using docker. No swarm, or other cluster manager that I am aware of.
  2. as per 1.
  3. That was preview3 dotnetcore from myget. (I've patched with the code above from what I think was the sha used to create those packages)
  4. Unfortunately I cannot share our setup. I can share some non-specific details tho:
    • Using AWS DynamoDb as the Custom cluster table provider (wrong terminology I know)
    • The client joins uses the table to identify possible silos
    • Underpinned by ElasticBeanstalk Autoscaling (MultiDocker Container)
    • The client is in a completely different Beanstalk.
    • We are not guaranteed the IP addresses of the instances, so cannot rely on that for the client. They do have a load balanced endpoint, but that cannot be used for the silos to communicate

If you were to replicate this locally, 2 machines running docker on both independently with identical networking setups, both containers would get the same docker ip address.
My silo config xml is as such:

<?xml version="1.0" encoding="utf-8"?>
<OrleansConfiguration xmlns="urn:orleans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="OrleansConfiguration.xsd">
  <Globals>
    <SystemStore SystemStoreType="Custom"
                 DeploymentId="0.1"
                 DataConnectionString="Service=eu-west-1;"
                 MembershipTableAssembly="OrleansAWSUtils"
                 ReminderTableAssembly="OrleansAWSUtils" />
    <ReminderService ReminderServiceType="None" />
  </Globals>
  <Defaults>
    <Networking Address="" Port="10000" />
    <ProxyingGateway Address="" Port="30000" />
    <Tracing DefaultTraceLevel="Info" TraceToConsole="true" />
  </Defaults>
</OrleansConfiguration>

The actual Network and ProxyGateway address are filled in at runtime with the hosts ip address.

The client config xml:

<?xml version="1.0" encoding="utf-8"?>
<ClientConfiguration xmlns="urn:orleans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="OrleansConfiguration.xsd">
  <SystemStore SystemStoreType="Custom"
             DeploymentId="0.1"
             DataConnectionString="Service=eu-west-1;"
             MembershipTableAssembly="OrleansAWSUtils"
             GatewayProvider="Custom"
             CustomGatewayProviderAssemblyName="OrleansAWSUtils"
             ReminderTableAssembly="OrleansAWSUtils" />
</ClientConfiguration>

Cheers, Paul

galvesribeiro commented 7 years ago

Ok, I need to investigate more, but, my first guess, is that Beanstalk does not build a network for its containers which explains why one silo can't talk to each other.

I'll read more about it but if that is the case, your proposed fix will make no good.

Orleans requires network connectivity between the silos (LAN, Site-To-Site VPN, etc). If your containers can't talk to each other, there is no way to form a cluster and that justify the messages you mentioned.

pms1969 commented 7 years ago

Hi @galvesribeiro, Why won't it work? I have a silo with this code deployed, and I've scaled it to 2 instances and it works just fine.

The containers can't talk directly to each other, but by publishing the network address of their host, they can. It also means that the clients don't have to override the gateway after discovery.

galvesribeiro commented 7 years ago

@pms1969 Beanstalk has no internal network. You need to publish the public IP address of each node and make the silos communicate using public IP/port which is not a good idea.

Orleans was designed to work in a trusted network since it has (yet) no means of secure the connections between silos and clients.

I would recommend you to move away from it if you plan to use Orleans. A good alternative is to use regular AWS EC2 Container service which gives you a container cluster where you can create networks between containers so Orleans would work fine with that document.

Just to give you an idea, Beanstalk works very similar to Azure App Services. It was made to pack simple (web) applications that are mostly stateless and only process incoming (http) requests. We can't run Orlans on Azure App Services for the same reason we can't on Beanstalk. There is no inter-node/private network. What people usually do, is to have the public/frontend Orleans client hosted on Azure App Service and connect it to another Azure service hosting Orleans Cluster (Cloud Service with Worker Roles, Container Service, VMs, etc.) by creating a VPN between the two vNets (the app service one and the one attached to the service running Orleans Cluster).

I hope it help.

beckjin commented 7 years ago

Hi @galvesribeiro,Can you give me an example of using Docker swarm?