akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.7k stars 1.04k forks source link

Akka.Remote: implement "log frame size exceeding" #879

Closed pcwiek closed 9 years ago

pcwiek commented 9 years ago

Is there a way to log everything that's happening in an actor and, if possible, the actor system in general?

I'm using TopShelf for service hosting and Serilog (min level: debug) for logging.

Including:

akka {
    (...)
     actor {
          loglevel = DEBUG
          stdout-loglevel = DEBUG
          provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
          debug {
             receive = on
             autoreceive = on
             fsm = on
             lifecycle = on
             unhandled = on
             event-stream = on
             router-misconfiguration = on
          }
    loggers = ["Akka.Logger.Serilog.SerilogLogger,Akka.Logger.Serilog"]
    remote {
          log-remote-lifecycle-events = DEBUG
          (...)
     }
}

in the configuration doesn't change anything when you compare the log output without the debug section specified at all.

I'm asking because I'm trying to track a really bizarre behavior (one message, which has 4 collections in it, all of collection items have Date (DateTime) member, is going nowhere unless I set members' dates to default(DateTime) in at least one of the collections...?!) but it might be hard to reproduce. So I'd rather exhaust all my monitoring and debugging options first before spending half my day setting up a separate solution...

rogeralsing commented 9 years ago

That could be some odd json.net datetime issue you are seeing. That would explain why it works with default values as those are not emitted in json by default

Skickat från min iPhone

20 apr 2015 kl. 17:37 skrev Patryk Ćwiek notifications@github.com:

Is there a way to log everything that's happening in an actor and, if possible, the actor system in general?

I'm using TopShelf for service hosting and Serilog (min level: debug) for logging.

Including:

akka { (...) actor { loglevel = DEBUG stdout-loglevel = DEBUG provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster" debug { receive = on autoreceive = on fsm = on lifecycle = on unhandled = on event-stream = on router-misconfiguration = on } loggers = ["Akka.Logger.Serilog.SerilogLogger,Akka.Logger.Serilog"] remote { log-remote-lifecycle-events = DEBUG (...) } } in the configuration doesn't change anything when you compare the log output without the debug section specified at all.

I'm asking because I'm trying to track a really bizarre behavior (one message, which has 4 collections in it, all of collection items have Date (DateTime) member, is going nowhere unless I set members' dates to default(DateTime) in at least one of the collections...?!) but it might be hard to reproduce. So I'd rather exhaust all my monitoring and debugging options first before spending half my day setting up a separate solution...

— Reply to this email directly or view it on GitHub.

pcwiek commented 9 years ago

Could be, although if I 'manually' convert to JSON, everything seems to work fine.

Also, dates in each and every collection are converted just fine separately, but not when I'm trying to send all of them, hence my enormous WTF.

rogeralsing commented 9 years ago

Can you try this?:

            var serializer = actorSystem.Serialization.FindSerializerFor(myMessage);
            var binary = serializer.ToBinary(myMessage);
            var result= serializer.FromBinary(binary, typeof(MyMessageType));

And see what comes out of result if its a good copy of your message, we can rule out serialization.

If it's not a serializer issue, my best bet would be that the dates doesnt have anything to do with it except for message size and it could be a remoting issue with large messages (?)

But lets verify the serializer first

pcwiek commented 9 years ago

Sure, I'll try that tomorrow in the morning.

My first thought was some kind of message size limit too - especially considering that all parts of the message seem to work on their own, but when I put them together the message goes into void - but I don't know nearly enough about internals of Akka to tell whether my hunch was even close to being right.

pcwiek commented 9 years ago

@rogeralsing I have just tried it, serialization works fine - round trip serialization gives the exact copy.

[EDIT] Although it's a bit of XY problem, my original question still stands - how to enable full debugging and tracing in a system? :) It would make it that much easier to troubleshoot...

[EDIT2] OK, I found a way to log all messages, unfortunately the heartbeat and gossip messages clutter the log a little bit...

akka {
    loglevel = DEBUG
    remote {
         log-sent-messages = on
         log-received-messages = on
    }
}
pcwiek commented 9 years ago

@rogeralsing @Aaronontheweb I encountered the same thing in a different place in the application. Is there some kind of a size limit for remote messages? It seems that big messages get lost somewhere on their way... Either that or there's something weird going on with Ask.

For example, this time I have a details screen with entries in it. From the business process perspective, the number of entries can vary between ~500 and 20000. One entry is a string (~10 - 12 characters) and a boolean flag. The details screen has an extra description (string, up to 200 characters), one date (DateTimeOffset), user name (string, in practice up to ~20 characters), boolean flag, two ids (int).

When I take <= ~950 elements, I get the reply back to the Web application that Asks for the data. When I take >= ~1000 elements, originalSender.Tell in the remote actor sends the reply, but Ask in the web app is just hanging there.

I want to add that the first case when I had that happen (~4 - 5 collections, 240 elements each) is also when I Ask the local proxy actor from the Web app. It can be a coincidence, but I think it's worth mentioning. That 'proxy' actor forwards the message to all remote actors in the cluster with original sender which should get the reply, which is, for example, tcp.akka//my-system@127.0.0.1:2300/temp/f or similar.

Serialization seems to work just fine...

Is it possible that that Ask right at the very beginning is messing something up? Could that be somehow related to the other issue with Tell going missing?

pcwiek commented 9 years ago

@rogeralsing @Aaronontheweb Another update! Sorry guys for flooding you with all that...

Anyway, when the message is 'small' (as in, it passes through), the chatter between the actors looks like this:

'Small' message

  1. Web (Asking local proxy actor)

    2015-04-22 12:08:45.914 -04:00 [Debug] "RemoteMessage: ActorSelectionMessage - Message: Shared.General.DetailsRequest - WildCartFanOut: False - Elements: user/details to [[akka.tcp://my-system@127.0.0.1:21300/]]<+[akka.tcp://my-system@127.0.0.1:21300/] from [[akka://my-system/temp/g]]"
  2. Service (processing and Tell to sender, which would be temporary Ask actor)

    • Receive is acked

      2015-04-22 12:08:45.922 -04:00 [Debug] received local message ["RemoteMessage: ActorSelectionMessage - Message: Shared.General.DetailsRequest - WildCartFanOut: False - Elements: user/details to [akka://my-system/]<+akka://my-system/ from [akka.tcp://my-system@127.0.0.1:23000/temp/g]"]
    • Reply is sent

      2015-04-22 12:08:46.019 -04:00 [Debug] "RemoteMessage: Shared.General.Data.Details to [[akka.tcp://my-system@127.0.0.1:23000/temp/g]]<+[akka.tcp://my-system@127.0.0.1:23000/temp/g] from [[akka://my-system/user/details/get/$d]]"
  3. Web (ack when Ask returns)

    2015-04-22 12:08:46.071 -04:00 [Debug] received local message ["RemoteMessage: Shared.General.Data.Details to [akka://my-system/temp/g]<+akka://my-system/temp/g from [akka.tcp://my-system@127.0.0.1:21300/user/details/get/$d]"]

Everything just works.

On the other hand...

'Big' message

  1. Web (Asking local proxy actor) [Pretty much the same as above]

    2015-04-22 12:08:06.641 -04:00 [Debug] "RemoteMessage: ActorSelectionMessage - Message: Shared.General.DetailsRequest - WildCartFanOut: False - Elements: user/details to [[akka.tcp://my-system@127.0.0.1:21300/]]<+[akka.tcp://my-system@127.0.0.1:21300/] from [[akka://my-system/temp/e]]"
  2. Service (processing and Tell to sender) [Pretty much the same as above]

    • Receive is acked

      2015-04-22 12:08:06.867 -04:00 [Debug] received local message ["RemoteMessage: ActorSelectionMessage - Message: Shared.General.DetailsRequest - WildCartFanOut: False - Elements: user/details to [akka://my-system/]<+akka://my-system/ from [akka.tcp://my-system@127.0.0.1:23000/temp/e]"]
    • Reply is sent

      2015-04-22 12:08:08.790 -04:00 [Debug] "RemoteMessage: Shared.General.Data.Details to [[akka.tcp://my-system@127.0.0.1:23000/temp/e]]<+[akka.tcp://my-system@127.0.0.1:23000/temp/e] from [[akka://my-system/user/details/get/$b]]"

    So, far so good, right?

  3. Web (ack when Ask returns)

    Missing completely...

No idea what happens here. All preceding steps and operations are exactly the same.

Aaronontheweb commented 9 years ago

@pcwiek so the big message makes it over the wire, but the Ack never returns?

pcwiek commented 9 years ago

@Aaronontheweb Hard to say if it actually makes it over the wire, because there's no way of telling what happens in the transport layer. What I can say is that the processing actor actually sends the message, but it's never received on the Ask side and Ask is just left there hanging...

Aaronontheweb commented 9 years ago

@pcwiek so one thing worth bearing in mind is that the underlying Helios transport does have a maximum size threshold for messages it can send:

https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka.Remote/Transport/Helios/HeliosTransport.cs#L76

MaxFrameSize must be at least 32k. By default it's 128k https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka.Remote/Configuration/Remote.conf#L359

You can increase the size of this value to support whatever you need, but be aware that some network stacks (UDP) may not support arbitrarily large messages. We have some tests inside Helios that ramp up to 4mb and pass, so you have a lot of room to work with when using TCP: https://github.com/helios-io/helios/blob/dev/tests/Helios.MultiNodeTests/LargeMessageTests.cs#L105

pcwiek commented 9 years ago

@Aaronontheweb Thank you!

I upped the maximum-frame-size to 32 MB and now even the large analysis results are making it back. Perfect!

I just wish there was some kind of an error when that kind of situation happens, so that it could be logged - something simple like 'Remote message exceeds maximum frame size' or something :) Right now it's a completely silent failure.

Aaronontheweb commented 9 years ago

@pcwiek so there is supposed to be a setting for this: https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka.Remote/Configuration/Remote.conf#L142

But we don't respect or use it inside Akka.Remote right now - so I'm going to rename this issue and leave it open as a to-do.

Aaronontheweb commented 9 years ago

This might need to wait until Helios 2.0 ships, where we'll have some better error capture. Alternatively we can do this inside the HeliosTcpAssociationHandle.

pcwiek commented 9 years ago

@Aaronontheweb For now I would just add a paragraph about message sizes and potential pitfalls to 'How transparency is broken' section in 'Remoting' docs. I can actually do that and send a pull request if you'd like.

Aaronontheweb commented 9 years ago

@pcwiek that would be excellent - please send along a PR

Aaronontheweb commented 9 years ago

Note - this needs to be implemented here: https://github.com/akkadotnet/akka.net/blob/dev/src/core/Akka.Remote/Endpoint.cs#L1237

Aaronontheweb commented 9 years ago

And here's part of the Scala implementation for this:

https://github.com/akka/akka/blob/master/akka-remote/src/main/scala/akka/remote/Endpoint.scala#L745

https://github.com/akka/akka/blob/master/akka-remote/src/main/scala/akka/remote/Endpoint.scala#L670

cpx86 commented 9 years ago

This looks fun :) I'll have a look at it.