dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.07k stars 2.03k forks source link

Console.WriteLine in GrainClient spams console output #1226

Closed PaulNorth closed 7 years ago

PaulNorth commented 8 years ago

Please consider changing the use of Console.WriteLine in GrainClient to a logger output, specifically in our case; Console.WriteLine("Initialization failed. {0}", exc); at line 260 in https://github.com/dotnet/orleans/blob/master/src/Orleans/Core/GrainClient.cs.

We are using a Polly Policy to retry client connection to an Orleans silo that may or may not be started at the time the client starts up. Retrying connection every second means the console is being spammed with the console output, using a logger would give us better control of this output.

jthelin commented 8 years ago

Part of the hard choices in GrainClient is when things are initialized far enough to allow use of the logging framework for progress / error and/or when to fall back to use Console.WriteLine so that some indication of initialization errors gets surfaced for diagnosis.

The logger config is contained in the client config file / ClientConfiguration object model, so the early part of the Initialize flow defensively uses Console to report problems. In fact, it looks as though GrainClient class always makes that assumption, and always uses Console, although Logger might have been initialized in underlying layers by the time we get to Line 260 https://github.com/dotnet/orleans/blob/master/src/Orleans/Core/GrainClient.cs#L260

@PaulNorth - I guess you are using this in Azure Client, which provides a retry wrapper around this GrainClient.Initialize method? It would be nice to see the full stack trace of the exception reported on Console, to understand the interdependencies between GrainClient and outer layer wrappers.

I think it should be possible to check if Logger.IsInitialised is already set up and use that instead of Console in several places in GrainClient class [at the cost of some minor code duplication, but probably worth it for this scenario]

PR's always welcome :)

PaulNorth commented 8 years ago

Hi @jthelin, thanks for your response which makes perfect sense. Regarding my initial query, sorry for the lack of detail, I probably cut a long story too short!

But as you asked... We are writing a microservice solution to be installed on a Windows Server platform which uses Consul (https://consul.io/) for service discovery and health checking. It occurred to us that Consul could also be used to hold Orleans Silo membership information and to that end we have written our own Orleans Consul Membership Provider based on the existing ZooKeeper implementation. (If there is interest we can create a PR for this once we have proven it works)

As we are hosting Consul and the Orleans Silo & Client/s each in a different microservice it is expected that sometimes the Client will start after Consul is running but before the Silo has started and we don't want this to be a critical failure scenario. So we are employing Polly (https://github.com/App-vNext/Polly) to retry GrainClient.Initialise when an attempt to contact the Consul Membership Provider fails, this works well but every time the Client tries to connect and fails we see the following output from Orleans... (Stack Trace:SiloNotAvailableStackTrace.txt).

I will try implement your proposed solution and see if it helps reduce the noise, let me know if there is any interest in creating a PR for Consul support.

Paul

sergeybykov commented 8 years ago

@PaulNorth

Consul could also be used to hold Orleans Silo membership information and to that end we have written our own Orleans Consul Membership Provider based on the existing ZooKeeper implementation. (If there is interest we can create a PR for this once we have proven it works)

Great! Is this code on GitHub?

Re: your original question. I'm thinking maybe it's worth adding an explicit flag in client config or somewhere to suppress output to Console, so that by default the behavior will stay as is, but there'll be a way to override it in cases like yours.

gabikliot commented 8 years ago

let me know if there is any interest in creating a PR for Consul support.

Absolutely!

EDIT: I would also like to understand how this compares/relates to Consul's built-in membership protocol, which is based on https://www.serfdom.io/ (https://consul.io/docs/internals/gossip.html).

PaulNorth commented 8 years ago

@sergeybykov I've not pushed it to my fork yet but I can do it tomorrow (UK time) and update you then. I can also try find time to look into a SupressConsoleOutput flag if you think this will add value.

@gabikliot We also considered serf could be employed as a more robust solution to the in memory GrainMembershipProvider but as we already had a need for Consul we haven't needed to pursue it. We were also waiting to see where the experimental geolocation project was going as this seems to share some features with Consul.

jthelin commented 8 years ago

maybe it's worth adding an explicit flag in client config or somewhere to suppress output to Console

I really don't see the point in adding this complexity. During startup / initialization, once the Logger subsystem is initialized then that should be used. But before that point [which of course should "never happen", but in real world will occasionally] then providing some diagnostics is always better than silence.

Having to redeploy a service to get startup failure diagnostics kind of defeats the point, because you automatically destroyed the evidence from the original startup attempt by doing that.

The problem @PaulNorth is describing here is about an "expected" transient race condition that should not really trigger Operations alerts immediately because it is most likely to just "eventually" clean itself up.

In the general case, watchdog systems & retry libraries like Polly needs to assume the worst case, especially if Orleans does not provide any / enough specific info to allow it to distinguish between "silo is not yet started" and "silo dead".

Even just having a clearly identifiable error code number [via Logger] to write an alert filter predicate with would be an improvement over the current situation.

Client logger subsystems should be initialized by the time the first silo-connect attempt is made by the client, so the "error" could be marked with specific error code to identify, but currently GrainClient doesn't make that distinction.

gabikliot commented 8 years ago

@PaulNorth , I would love to see your integration code with Consul, and also ideally have some document/page explaining conceptually how you use it with Orleans, what were the options (serf. vs Orleans membership), and if you had any thoughts on geo-distribution with Consul.

Re logging: we can do it either @jthelin or @sergeybykov way, I don't have a strong opinion either way. One or other (cleaner or more hacky) solution, we should be able to solve that small nuance for you.

PaulNorth commented 8 years ago

@gabikliot, I have created a PR (https://github.com/dotnet/orleans/pull/1267) for the Consul Membership Provider code, and the cla is with my employer to approve. I'm not sure how to contribute to the Orleans documentation so I will add usage details and try to answer your questions on the PR conversation as that avoids taking this issue any further off-topic.

I have also investigated the Logging vs Console.WriteLine issue and as suggested it is non-trivial...

As you will know, the client connection error occurs in the ctor of OutsideRuntimeClient when attempting to open the transport connection. This error is already logged in detail to the application TraceLogger just before the OutsideRuntimeClient.ConstructorReset() method is called which disposes this TraceLogger. The try..catch in the GrainClient.Initialize() then unnecessarily(?) writes the Exception again to the Console because the TraceLogger is now gone; so it is an issue with tying the lifetimes of the OutsideRuntimeClient and TraceLogger together.

There are many ways I can think to resolve this including application specific exceptions or separating the construction and initialization routines for OutsideRuntimeClient; however I feel the path of least disruption would be moving just the logging initialise out of the ctor of the OutsideRuntimeClient, into the GrainClient.Initialise(). The configured logger could then be passed into OutsideRuntimeClient and used in GrainClient even if a network issue causes the OutsideRuntimeClient.ctor() to rollback. The only cases that would then result in Console logging would be if the logging configuration was invalid.

sergeybykov commented 8 years ago

@PaulNorth Thank you for the PR! We'll review it.

Re: How to contribute to the docs. The docs are in the https://github.com/dotnet/orleans/tree/gh-pages branch. Simply submit a PR against that branch with doc changes.