Unreachable nodes stay in Up state and nodes become unreachable when a node in the same role are unreachable.

Blind-Striker commented 6 years ago

Akka.Net v1.3.5, DotNetty 0.4.6, Akka.Serialization.Hyperion v1.3.4-beta58
.Net Framework 4.6.1, Windows Server 2016 Datacenter, AWS EC2 Medium instances.

Hi,

A few weeks ago we moved our Akka.Cluster application that is responsible for sending notifications to AWS along with our all infrastructure.

We have an Akka Cluster application for Sms, Email and Push notifications. 3 different kind of nodes are exist in the system, which are client, sender and lighthouse. Client role is being used by Web application and API application(Web and API is hosted at IIS). Lighthouse and Sender roles are being hosted as a Windows service.

The number of Nodes in the client role has increased with we moved our infra to AWS. We have now 5 client nodes in total for 3 Web App and 2 for API. All these nodes are in separate EC2 instances.

We have a cluster like this;

----------- Web App -------------- client 1 : 172.31.17.81:3563 client 2 : 172.31.17.193:3563 client 3 : 172.31.18.2:3563 ----------- Api ------------------ client 4 : 172.31.18.147:3564 client 5 : 172.31.16.106:3564

Lighthouse 1 : 172.31.27.81:17527 sender : 172.31.27.81:56591

Lighthouse 2 : 172.31.31.130:17528

And stability problems began after. Nodes started to change their states between reachable and unreachable to frequently. After our investigations we learn that: this kind of issue could occur on AWS instances that have "low to modarate" network. So we decided to change failure-detector threshold settings.

    cluster {
            seed-nodes = ["akka.tcp://notificationSystem@127.0.0.1:5054"]
            roles = [sender]

            failure-detector {
                threshold = 12
                acceptable-heartbeat-pause = 30s
                heartbeat-interval = 5s
                expected-response-after = 10 s
            }

            use-dispatcher = cluster-dispatcher                     

            run-coordinated-shutdown-when-down = on
    }
}

cluster-dispatcher {
    type = "Dispatcher"
    executor = "fork-join-executor"
    fork-join-executor {
        parallelism-min = 2
        parallelism-max = 4
    }

After making these settings, the cluster became more stable and started to run smoothly for a long time. But we started to observe that every 3-4 days the nodes in the client role were became quarantined.

Interesting thing is that their status are Up despite being unreachable.

capture

When we examined the logs, we found a more interesting things.

Role	Node	IP	RenderedMessage
lighthouse	lighthouse 1	172.31.27.81:17527	{ Source: "akka", LogSource: "[akka://notificationSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FnotificationSystem%40172.31.17.81%3A3563-17140/endpointWriter#865218644]", ThreadName: "akka.remote.default-remote-dispatcher_1", Message: "AssociationError [akka.tcp://notificationSystem@172.31.27.81:17527] -> akka.tcp://notificationSystem@172.31.17.81:3563: Error [Association failed with akka.tcp://notificationSystem@172.31.17.81:3563] []", Timestamp: 03/05/2018 09:35:21 }
lighthouse	lighthouse 1	172.31.27.81:17527	{ Source: "akka", LogSource: "[akka://notificationSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FnotificationSystem%40172.31.17.193%3A3563-17141/endpointWriter#1265060635]", ThreadName: "akka.remote.default-remote-dispatcher_3", Message: "AssociationError [akka.tcp://notificationSystem@172.31.27.81:17527] -> akka.tcp://notificationSystem@172.31.17.193:3563: Error [Association failed with akka.tcp://notificationSystem@172.31.17.193:3563] []", Timestamp: 03/05/2018 09:35:21 }
lighthouse	lighthouse 1	172.31.27.81:17527	{ Source: "akka", LogSource: "[akka://notificationSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FnotificationSystem%40172.31.17.193%3A3563-17142/endpointWriter#1389557585]", ThreadName: "akka.remote.default-remote-dispatcher_2", Message: "AssociationError [akka.tcp://notificationSystem@172.31.27.81:17527] -> akka.tcp://notificationSystem@172.31.17.193:3563: Error [Association failed with akka.tcp://notificationSystem@172.31.17.193:3563] []", Timestamp: 03/05/2018 09:35:22 }
client	api 1	172.31.18.147:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1833147155]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.81:3563, ", Timestamp: 03/05/2018 09:35:24 }
client	api 1	172.31.18.147:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1833147155]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.193:3563, ", Timestamp: 03/05/2018 09:35:24 }
client	api 1	172.31.18.147:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1833147155]", ThreadName: null, Message: "Unreachable Member; Role: lighthouse, Status: Up, Address: akka.tcp://notificationSystem@172.31.27.81:17527, ", Timestamp: 03/05/2018 09:35:24 }
lighthouse	lighthouse 2	172.31.31.130:17528	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#295365818]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.193:3563, ", Timestamp: 03/05/2018 09:35:25 }
lighthouse	lighthouse 2	172.31.31.130:17528	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#295365818]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.81:3563, ", Timestamp: 03/05/2018 09:35:25 }
lighthouse	lighthouse 2	172.31.31.130:17528	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#295365818]", ThreadName: null, Message: "Unreachable Member; Role: lighthouse, Status: Up, Address: akka.tcp://notificationSystem@172.31.27.81:17527, ", Timestamp: 03/05/2018 09:35:25 }
lighthouse	lighthouse 2	172.31.31.130:17528	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#295365818]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.18.147:3564, ", Timestamp: 03/05/2018 09:35:25 }
client	api 2	172.31.16.106:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1527169364]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.18.147:3564, ", Timestamp: 03/05/2018 09:35:25 }
client	api 2	172.31.16.106:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1527169364]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.81:3563, ", Timestamp: 03/05/2018 09:35:25 }
client	api 2	172.31.16.106:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1527169364]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.193:3563, ", Timestamp: 03/05/2018 09:35:25 }
client	api 2	172.31.16.106:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1527169364]", ThreadName: null, Message: "Unreachable Member; Role: lighthouse, Status: Up, Address: akka.tcp://notificationSystem@172.31.27.81:17527, ", Timestamp: 03/05/2018 09:35:25 }
lighthouse	lighthouse 1	172.31.27.81:17527	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1974900305]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.81:3563, ", Timestamp: 03/05/2018 09:35:26 }
lighthouse	lighthouse 1	172.31.27.81:17527	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1974900305]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.193:3563, ", Timestamp: 03/05/2018 09:35:26 }
lighthouse	lighthouse 1	172.31.27.81:17527	{ Source: "akka", LogSource: "[akka://notificationSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FnotificationSystem%40172.31.17.81%3A3563-17143/endpointWriter#318817580]", ThreadName: "akka.remote.default-remote-dispatcher_4", Message: "AssociationError [akka.tcp://notificationSystem@172.31.27.81:17527] -> akka.tcp://notificationSystem@172.31.17.81:3563: Error [Association failed with akka.tcp://notificationSystem@172.31.17.81:3563] []", Timestamp: 03/05/2018 09:35:26 }
lighthouse	lighthouse 1	172.31.27.81:17527	{ Source: "akka", LogSource: "[akka://notificationSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FnotificationSystem%40172.31.17.193%3A3563-17144/endpointWriter#397417706]", ThreadName: "akka.remote.default-remote-dispatcher_2", Message: "AssociationError [akka.tcp://notificationSystem@172.31.27.81:17527] -> akka.tcp://notificationSystem@172.31.17.193:3563: Error [Association failed with akka.tcp://notificationSystem@172.31.17.193:3563] []", Timestamp: 03/05/2018 09:35:26 }
client	api 1	172.31.18.147:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1833147155]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.81:3563, ", Timestamp: 03/05/2018 09:35:29 }
client	api 1	172.31.18.147:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1833147155]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.17.193:3563, ", Timestamp: 03/05/2018 09:35:29 }
client	api 1	172.31.18.147:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1833147155]", ThreadName: null, Message: "Unreachable Member; Role: lighthouse, Status: Up, Address: akka.tcp://notificationSystem@172.31.27.81:17527, ", Timestamp: 03/05/2018 09:35:29 }
lighthouse	lighthouse 2	172.31.31.130:17528	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#295365818]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.18.147:3564, ", Timestamp: 03/05/2018 09:35:30 }
client	api 2	172.31.16.106:3564	{ Source: "akka", LogSource: "[akka://notificationSystem/user/clusterstatus#1527169364]", ThreadName: null, Message: "Unreachable Member; Role: client, Status: Up, Address: akka.tcp://notificationSystem@172.31.18.147:3564, ", Timestamp: 03/05/2018 09:35:30 }
lighthouse	lighthouse 1	172.31.27.81:17527	{ Source: "akka", LogSource: "[akka://notificationSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FnotificationSystem%40172.31.17.193%3A3563-17149/endpointWriter#476972598]", ThreadName: "akka.remote.default-remote-dispatcher_1", Message: "AssociationError [akka.tcp://notificationSystem@172.31.27.81:17527] -> akka.tcp://notificationSystem@172.31.17.193:3563: Error [Association failed with akka.tcp://notificationSystem@172.31.17.193:3563] []", Timestamp: 03/05/2018 09:35:36 }

Every node in the cluster agree that nodes web 1 - 172.31.17.81:3563 and web 2 - 172.31.17.193:3563 are unreachable yet their status is Up. Api 1 - 172.31.18.147:3564 thinks lighthouse 1 - 172.31.27.81:17527 is unreachable along with the web 1 and web 2 Api 2 - 172.31.16.106:3564 thinks api 1 and lighthouse 1 is unreachable along with the web 1 and web 2 Lighthouse 2 - 172.31.31.130:17528 thinks api 1 and lighthouse 2 is unreachable along with the web 1 and web 2

And so on.

We saw most of the AssociationErrors that given by lighthouse nodes has InnerException :

An entry with the same key already exists.

Stacktrace

 at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
   at System.Collections.Generic.TreeSet`1.AddIfNotPresent(T item)
   at System.Collections.Generic.SortedDictionary`2.Add(TKey key, TValue value)
   at Akka.Routing.ConsistentHash.Create[T](IEnumerable`1 nodes, Int32 virtualNodesFactor)
   at Akka.Cluster.Tools.Client.ClusterReceptionist.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)

But if we manually down one of the unreachable members (for example web 1 - 172.31.17.81:3563) all cluster became stabilized and every nodes became reacheable again and start working. After that we restart the node that we manually downed and that nodes joins the cluster without any problem.

We experience that issue every 3 or 4 days.

When one of the nodes becomes unreachable, the other nodes in the same role become unreachable too. And if we remove one of them all become reachable.

Our hocon configs ;

Lighthouse ;

lighthouse{
        actorsystem: "notificationSystem"
}

akka {
    loglevel = INFO

    actor { 
        provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"

        serializers {
            hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
        }

        serialization-bindings {
            "System.Object" = hyperion
        }
    }

    remote {
        log-remote-lifecycle-events = ERROR
        dot-netty.tcp {
            transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport,Akka.Remote"
            applied-adapters = []
            transport-protocol = tcp
            #will be populated with a dynamic host-name at runtime if left uncommented
            public-hostname = "127.0.0.1"
            hostname = "127.0.0.1"
            port = 5054
        }
    }            

    loggers = ["Armut.Logging.Akka.ArmutAkkaLogger, Armut.Logging.Akka"]

    cluster {
        seed-nodes = ["akka.tcp://notificationSystem@127.0.0.1:5054"]
        #seed-node-timeout = 15s
        #auto-down-unreachable-after = 10s
        roles = [lighthouse]

        failure-detector {
            threshold = 12
            acceptable-heartbeat-pause = 30s
            heartbeat-interval = 5s
            expected-response-after = 10 s
        }

        use-dispatcher = cluster-dispatcher                         

        run-coordinated-shutdown-when-down = on
    }
}

cluster-dispatcher {
    type = "Dispatcher"
    executor = "fork-join-executor"
    fork-join-executor {
        parallelism-min = 2
        parallelism-max = 4
    }
}

Client ;

akka{
    # stdout-loglevel = DEBUG
    loglevel = DEBUG
    # log-config-on-start = on

    #loggers = ["Armut.Logging.Akka.ArmutAkkaLogger, Armut.Logging.Akka"]

    actor{
        debug {  
            # receive = on 
            # autoreceive = on
            # lifecycle = on
            # event-stream = on
            # unhandled = on
        }         

        provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"           

        serializers {
            hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
            akka-pubsub = "Akka.Cluster.Tools.PublishSubscribe.Serialization.DistributedPubSubMessageSerializer, Akka.Cluster.Tools"
            akka-cluster-client = "Akka.Cluster.Tools.Client.Serialization.ClusterClientMessageSerializer, Akka.Cluster.Tools"
        }

        serialization-bindings {
            "System.Object" = hyperion 
            "Akka.Cluster.Tools.PublishSubscribe.IDistributedPubSubMessage, Akka.Cluster.Tools" = akka-pubsub
            "Akka.Cluster.Tools.PublishSubscribe.Internal.SendToOneSubscriber, Akka.Cluster.Tools" = akka-pubsub
            "Akka.Cluster.Tools.Client.IClusterClientMessage, Akka.Cluster.Tools" = akka-cluster-client
        }

        serialization-identifiers {
            "Akka.Cluster.Tools.PublishSubscribe.Serialization.DistributedPubSubMessageSerializer, Akka.Cluster.Tools" = 9
            "Akka.Cluster.Tools.Client.Serialization.ClusterClientMessageSerializer, Akka.Cluster.Tools" = 15
        }

        deployment{
        /NotificationCoordinator/AwsSnsActor{
                router = round-robin-pool
                resizer{
                    enabled = on
                    lower-bound = 3
                    upper-bound = 5
                }
            }                       

            /NotificationCoordinator/EmailActor{
                router = round-robin-pool
                resizer{
                    enabled = on
                    lower-bound = 3
                    upper-bound = 5
                }
            }

            /NotificationCoordinator/SmsActor{
                router = round-robin-pool
                resizer{
                    enabled = on
                    lower-bound = 3
                    upper-bound = 5
                }
            }

            /NotificationCoordinator/LoggingCoordinator/DatabaseActor{
                router = round-robin-pool
                resizer{
                    enabled = on
                    lower-bound = 3
                    upper-bound = 5
                }
            }                           

            /NotificationDeciding/NotificationDecidingWorkerActor{
                router = round-robin-pool
                resizer{
                    enabled = on
                    lower-bound = 3
                    upper-bound = 5
                }
            }

            /ScheduledNotificationCoordinator/SendToProMaster/JobToProWorker{
                router = round-robin-pool
                resizer{
                    enabled = on
                    lower-bound = 3
                    upper-bound = 5
                }
            }
        }
    }

    remote{                         
        log-remote-lifecycle-events = DEBUG
        log-received-messages = off

        dot-netty.tcp {
            transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport,Akka.Remote"
            applied-adapters = []
            transport-protocol = tcp
            #will be populated with a dynamic host-name at runtime if left uncommented
            #public-hostname = "POPULATE STATIC IP HERE"
            hostname = "127.0.0.1"
            port = 0
        }
    }

    cluster {
        seed-nodes = ["akka.tcp://notificationSystem@127.0.0.1:5054"]
        roles = [sender]

        failure-detector {
            threshold = 12
            acceptable-heartbeat-pause = 30s
            heartbeat-interval = 5s
            expected-response-after = 10 s
        }

        use-dispatcher = cluster-dispatcher                     

        run-coordinated-shutdown-when-down = on
    }
}

cluster-dispatcher {
    type = "Dispatcher"
    executor = "fork-join-executor"
    fork-join-executor {
        parallelism-min = 2
        parallelism-max = 4
    }
}

Sender ;

akka{
    loglevel = DEBUG

    actor{
        provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"

        deployment {
            /coordinatorRouter {
                router = round-robin-group
                routees.paths = ["/user/NotificationCoordinator"]
                cluster {
                        enabled = on
                        max-nr-of-instances-per-node = 1
                        allow-local-routees = off
                        use-role = sender
                }
            }

            /decidingRouter {
                router = round-robin-group
                routees.paths = ["/user/NotificationDeciding"]
                cluster {
                        enabled = on
                        max-nr-of-instances-per-node = 1
                        allow-local-routees = off
                        use-role = sender
                }
            }
        }

        serializers {
            hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
            akka-pubsub = "Akka.Cluster.Tools.PublishSubscribe.Serialization.DistributedPubSubMessageSerializer, Akka.Cluster.Tools"
            akka-cluster-client = "Akka.Cluster.Tools.Client.Serialization.ClusterClientMessageSerializer, Akka.Cluster.Tools"
        }

        serialization-bindings {
            "System.Object" = hyperion 
            "Akka.Cluster.Tools.PublishSubscribe.IDistributedPubSubMessage, Akka.Cluster.Tools" = akka-pubsub
            "Akka.Cluster.Tools.PublishSubscribe.Internal.SendToOneSubscriber, Akka.Cluster.Tools" = akka-pubsub
            "Akka.Cluster.Tools.Client.IClusterClientMessage, Akka.Cluster.Tools" = akka-cluster-client
        }

        serialization-identifiers {
            "Akka.Cluster.Tools.PublishSubscribe.Serialization.DistributedPubSubMessageSerializer, Akka.Cluster.Tools" = 9
            "Akka.Cluster.Tools.Client.Serialization.ClusterClientMessageSerializer, Akka.Cluster.Tools" = 15
        }

        debug{
            receive = on
            autoreceive = on
            lifecycle = on
            event-stream = on
            unhandled = on
        }
    }

    remote {
        dot-netty.tcp {
            transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport,Akka.Remote"
            applied-adapters = []
            transport-protocol = tcp
            hostname = "127.0.0.1"
            port = 0
        }
    }

    cluster {
        seed-nodes = ["akka.tcp://notificationSystem@127.0.0.1:5054"]
        roles = [client]

        failure-detector {
            threshold = 15
            acceptable-heartbeat-pause = 30s
            heartbeat-interval = 5s
            expected-response-after = 10 s
        }

        use-dispatcher = cluster-dispatcher     

        run-coordinated-shutdown-when-down = on
    }
}

cluster-dispatcher {
    type = "Dispatcher"
    executor = "fork-join-executor"
    fork-join-executor {
        parallelism-min = 2
        parallelism-max = 4
    }
}

Aaronontheweb commented 6 years ago

Thanks for all of the detail @Blind-Striker - I'll look into this.

Blind-Striker commented 6 years ago

We found more detailed logs. The problem may be related to Dotnetty. At least we get exceptions below before the things I mentioned above happen.

First, all the healthy nodes in the cluster are logging following that, there is an unreachable node in cluster.

{ Source: "akka", LogSource: "[akka://notificationSystem/system/transports/akkaprotocolmanager.tcp.0/akkaProtocol-tcp%3A%2F%2FnotificationSystem%40172.31.27.81%3A56335-26#2004965724]", ThreadName: "akka.remote.default-remote-dispatcher_1", Message: "No response from remote. Handshake timed out or transport failure detector triggered.", Timestamp: 03/12/2018 08:02:23 }

I do not know what happens next, but the other nodes in the same role are also treated as unreachable, not just the real unreachable node.

I also add dotnetty log that I think might be relevant to these exceptions. The first unreachable node has recorded the exception below.

Exception.Source : DotNetty.Transport Exception.Stack : at DotNetty.Transport.Channels.Sockets.TcpSocketChannel.DoReadBytes(IByteBuffer byteBuf) at DotNetty.Transport.Channels.Sockets.AbstractSocketByteChannel.SocketByteChannelUnsafe.FinishRead(SocketChannelAsyncOperation operation)

Exception.Message : An established connection was aborted by the software in your host machine

{ Source: "akka", LogSource: "Akka.Remote.Transport.DotNetty.TcpServerHandler", ThreadName: null, Message: "Error caught channel [[::ffff:172.31.18.2]:3563->[::ffff:172.31.27.81]:54706](Id=37f94ea3)", Timestamp: 03/12/2018 08:02:58 }

Blind-Striker commented 6 years ago

@Aaronontheweb Apparently it was our fault. You can close this issue

Danthar commented 6 years ago

After that epic opening post. Please don't leave us in suspense :P What was the solution ?

Blind-Striker commented 6 years ago

Well, actually I don't know. Our Web and API applications were entering cluster as a client role. We have moved client role a Nancy Self-hosted application and dockerize it. We removed Akka references from our Web and API applications, now they sending messages via HTTP to the new self-hosted application. And everything is working fine. We didn't experience single problem for weeks. And our cluster have more nodes than before.

I don't know really, theoretically, we didn't change anything. The architecture is same, we just moved our client logic to new self-hosted application and scale it.

Maybe Akka is not working good with IIS and app pools, even if we set app pools to never recycle, by design IIS recycle app pools every 28 hours. And we deploying our applications sometimes more than five times in a day. We have experienced problems that I told above every day or two.

We are still using Hyperion even in our new self-host API. We have developed a Hyperion extension for Nancy ;

https://github.com/armutcom/Nancy.Serialization.Hyperion

Sorry, @Danthar I don't know what was the problem or is it our fault or not but we couldn't wait for a solution that even we couldn't found the source of the problem.

My humble advice is to try other architectural alternatives before using Akka in IIS Web Applications directly. Right now everything is working perfectly, i wish i could help more :)

Aaronontheweb commented 6 years ago

@Blind-Striker thanks for the write-up - sorry to hear about this trouble with Akka.NET and IIS. Sounds like you went through a lot of trouble to try to get the two to play nicely. I'd like us to look into a way to make this experience better - hope you don't mind if we follow up with you with questions in the future!

Blind-Striker commented 6 years ago

@Aaronontheweb, Of Course, we always up to the help.

Blind-Striker commented 6 years ago

@Aaronontheweb After we moved our client role to a self-hosted application, we have adapted our notification client libraries to work with HTTP instead of actor.tell. Because of the http responses was not important for us, we design this library as fire and forget style. Like ;

return Task.Run(() =>
{
    using (var stream = new MemoryStream())
    {
        // Hyperion
        var serializer = new Serializer(new SerializerOptions(
                        preserveObjectReferences: true,
                        versionTolerance: true,
                        ignoreISerializable: true));

        serializer.Serialize(body, stream);

        var content = stream.ToArray();
        var byteArrayContent = new ByteArrayContent(content);
        byteArrayContent.Headers.ContentType = new MediaTypeHeaderValue("application/x-hyperion");

        var postAsync = _httpClient.PostAsync(path, byteArrayContent);

        postAsync.Wait();
    }
});

Then we removed Akka references from our Web and API applications and replace old client libray with new one. We didn't experience any problem with API. But our WEB application started to give strange async/await related Exceptions. After a little search on the Google we learned that even if you update the .Net Runtime Framework to newer runtime (in our case it was net 4.6.1 and we updated it 3 years ago) you need to update <httpRuntime targetFramework="" /> manually.

Our Web app is very very old legacy application that we about to get rid of it. After a little search we learn that async/await features only work stable if you use targetFramework 4.5 or above. So we replaced async calls with sync calls and every thing started to work stable.

We knew that the Web app always was the problematic part of our Akka cluster we couldnt manage to pin point exact problem. Anyway, it was too late to update httpRuntime targetFramework and try with Akka again. because we have already completed our transition to new self-hosted approach. And within months we completly take off our legacy Web app. So we thought that its not worth the effort.

Perhaps the problem we had with Akka was httpRuntime targetFramework problem . Or at least it was of one of problems.

May be it can help you to replicate our problem.

Anyway we always up to the help Akka community. We already using Akka in our project widely and except this problem our developer team is very happy with Akka :)

akkadotnet / akka.net

Unreachable nodes stay in Up state and nodes become unreachable when a node in the same role are unreachable. #3348