Closed IanPattersonMuo closed 6 years ago
I have simplified this to calling a remote actor and the problem persists, which means that it is probably not an issue with either Clustering or Sharding, and is a more general problem of networking on Linux.
I've been able to reproduce this using the sample above, so the original sample fails on my machine too.
However, when I ported WebCrawler and ran that in a pure Linux environment this week - had no issues at all. Makes me wonder if the problem might actually be related to the combination of the .NET Core version + Linux.
@IanPattersonMuo so if I look at the Dockerfiles I'm using for WebCrawler, I'm using an older version of the .NET Core 2.0 runtime: https://github.com/petabridge/akkadotnet-code-samples/blob/master/Cluster.WebCrawler/src/WebCrawler.Web/Dockerfile
2.0 vs 2.0.5.
But I don't think that's the issue. I'm wondering though if the issue might stem from the way the Docker images are built in your solution. Judging from your build.sh
, looks like the images are dotnet publish
-ed locally and then subsequently copied into an image. The way mine work is through the use of the Docker build pipeline - my project is actually restored and compiled inside the same runtime that the Docker image itself will be executing at runtime. I wonder if there's an environmental difference between your build-time enviroment and your run-time environment that could cause this - such as one of the .NET Standard libraries having a slightly different socket IO implementation depending on the host OS.
What do you think - is that worth trying to rule out as a possibility?
I have tested it natively on multiple Linux hosts where it is build and run on each host. This failed as well so I don't think it is to do with the way it is built in docker. I'll look to give it a go with earlier version of the runtime and see if that makes a difference.
@IanPattersonMuo I take that back - the versioning is a bit misleading around these images.
microsoft/aspnetcore:2.0 - this actually is an image that is updated for each minor revision. Meaning that the version itself is mutable. It's not 2.0.0, but rather 2.0.8 or whatever the latest of the 2.0.* branch is. So I'm actually using a newer version of the image.
I have checked it with the 2.1.0 preview images as well with the same error. I created a branch in the sample code which just uses akka.remoting rather than clustering and sharding and it fails with the same error. I also created a branch that uses Windows containers and it works correctly and it works as expected.
@Aaronontheweb I checked the issue building the code within the container itself and it fails with the same error.
@IanPattersonMuo I'm going to bring this up with the DotNetty folks - sure looks like a message framing inconsistency between platforms. Weird part is that I can't recreate this error using WebCrawler on Akka.Remote / Akka.Cluster in .NET Core on Linux, but I can with your application...
Akka.Net 1.3.5
We get an "Error while decoding incoming Akka PDU" exception when sending a large number of messages from a Shard Entity to another actor in a different process in the same cluster. It manifests when deployed on multiple Linux machines or whilst running in Docker (Linux). The error is seen on the client side of the communication and forces the process to Disassociate from the cluster. The full stack trace from the client process is as follows:
The server also shows a disassociation error
I have created a sample application that replicates the issue.
https://github.com/muo-ltd/Akka-NetCore-DockerClusterWithShards
It creates a cluster with two processes, one acts as a client and the other acts as a server. The server has a Sharded Entity which the client sends a message to requesting information. The Sharded entity then streams a large number of messages back to the client to consume. The messages returned are simple, the payload is string with 1000 characters all 1's, and 10000 messages are generated and returned. Using a lower number of messages (e.g. 1000) does not show this error so it does appear to relate to volume. I have tested it using both the default JSON serialiser and Hyperion. The only difference between the two is that the Hyperion error is slightly different. It contains the error
Full stack trace below
While running this locally on Mac, Windows or Linux it appears to work correctly. If it is deployed to Docker it will fail and if deployed across multiple Linux hosts it will also fail. Deploying to multiple Windows hosts appears to work correctly.