akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.67k stars 1.04k forks source link

Remoting issue in Azure Container Instances #3821

Open ddobric opened 5 years ago

ddobric commented 5 years ago

Hi all,

I have two systems S1 and S2 implemented in .NET core. First one is running on Windows and second one S2 is running in ACI (Azure Container Instance) in Linux Docker container. When running all locally Windows Host S1 and Linux Docker container S2 all works fine. When I move S2 to ACI starts repeating following message:

[DEBUG][06/11/2019 13:30:55][Thread 0009][[akka://HtmCluster/system/transports/akkaprotocolmanager.tcp.0/akkaProtocol-tcp%3A%2F%2FHtmCluster%40%5B%3A%3Affff%3A10.240.255.56%5D%3A56701-2#443150340]] Association between local [tcp://HtmCluster@wk-caas-c297c6ed33e84959acbebf8240ec2752-c1e979fb02d0e830c9dd14:8081] and remote [tcp://HtmCluster@[::ffff:10.240.255.56]:56701] was disassociated because the ProtocolStateActor failed: Unknown

It seems that there is a communication issue between Akka system and my actor.

Interestingly, I posted another issue #3820 , which uses cluster provider instead of remoting. That one binds correctly listeners in ACI. (It has another issues, but this is not in scope here).

Here is the full log. Note that last two lines are repeating:

[DEBUG][06/11/2019 13:30:40][Thread 0001][EventStream(HtmCluster)] Logger log1-DefaultLogger [DefaultLogger] started
[DEBUG][06/11/2019 13:30:40][Thread 0001][EventStream(HtmCluster)] StandardOutLogger being removed
[DEBUG][06/11/2019 13:30:40][Thread 0001][EventStream(HtmCluster)] Default Loggers started
[INFO][06/11/2019 13:30:40][Thread 0001][remoting] Starting remoting
[DEBUG][06/11/2019 13:30:40][Thread 0009][remoting] Starting prune timer for endpoint manager...
[INFO][06/11/2019 13:30:41][Thread 0001][remoting] Remoting started; listening on addresses : [akka.tcp://HtmCluster@nodeX.westeurope.azurecontainer.io:8081]
[INFO][06/11/2019 13:30:41][Thread 0001][remoting] Remoting now listens on addresses: [akka.tcp://HtmCluster@nodeX.westeurope.azurecontainer.io:8081]
[INFO][06/11/2019 13:30:41][Thread 0001][ActorSystem(HtmCluster)]   akka : {
    debug : {
      receive : on
      autoreceive : on
      lifecycle : on
      event-stream : on
      unhandled : on
    }
    loglevel : DEBUG
    actor : {
      provider : "Akka.Remote.RemoteActorRefProvider, Akka.Remote"
    }
    log-config-on-start : on
    remote : {
      connection-timeout : "120 s"
      transport-failure-detector : {
        heartbeat-interval : 1000s
        acceptable-heartbeat-pause : 6000s
      }
      dot-netty : {
        tcp : {
          maximum-frame-size : 32000000b
          dns-use-ipv6 : false
          port : 8081
          public-hostname : nodeX.westeurope.azurecontainer.io
          hostname : wk-caas-c297c6ed33e84959acbebf8240ec2752-c1e979fb02d0e830c9dd14
        }
      }
    }
  }

Press any key to stop AKKA.NET Node
[DEBUG][06/11/2019 13:30:50][Thread 0008][[akka://HtmCluster/system/transports/akkaprotocolmanager.tcp.0/akkaProtocol-tcp%3A%2F%2FHtmCluster%40%5B%3A%3Affff%3A10.240.255.56%5D%3A56638-1#1261179201]] Association between local [tcp://HtmCluster@wk-caas-c297c6ed33e84959acbebf8240ec2752-c1e979fb02d0e830c9dd14:8081] and remote [tcp://HtmCluster@[::ffff:10.240.255.56]:56638] was disassociated because the ProtocolStateActor failed: Unknown
[DEBUG][06/11/2019 13:30:55][Thread 0009][[akka://HtmCluster/system/transports/akkaprotocolmanager.tcp.0/akkaProtocol-tcp%3A%2F%2FHtmCluster%40%5B%3A%3Affff%3A10.240.255.56%5D%3A56701-2#443150340]] Association between local [tcp://HtmCluster@wk-caas-c297c6ed33e84959acbebf8240ec2752-c1e979fb02d0e830c9dd14:8081] and remote [tcp://HtmCluster@[::ffff:10.240.255.56]:56701] was disassociated because the ProtocolStateActor failed: Unknown
...

Akka Remote version

Akka.Remote" Version="1.3.13"

.NET Core host

netcoreapp2.2

@snekbaev

ddobric commented 5 years ago

Any hint?

Horusiath commented 5 years ago

@ddobric without code it's hard to tell, but in most cases dissassociation is a result of:

  1. Message serialization failure - when message send from one node to another couldn't be deserialized, because other node either doesn't have necessary serializer loaded, or provided message type is unknown on the recipient side (eg. because assembly it has been defined in, is not referenced there).
  2. Other assembly call failure - like trying to remotely deploy an actor to another node, which doesn't know that actor type (again missing assembly).
  3. Node dies on the other side.
  4. Node is periodically unavailable due to temporary network disconnection - this usually heals itself. Also this issue can occur as result of debugging, which stops thread execution, preventing node from sending heartbeats in timely manner.
ddobric commented 5 years ago

Thanks for your feedback. Issue is related to 3 and 4. Sometimes some removing participants (S1 in this case) just go offline and come back. This is defined by scenario. The problem here is that after starting of S1 again association sometimes works well, but sometimes it doesn't. To me this seems to be either but or some configuration issue.