Azure / diagnostics-eventflow

Microsoft Diagnostics EventFlow
MIT License
303 stars 96 forks source link

EventFlow vs. Semantic Logging #24

Closed cwe1ss closed 7 years ago

cwe1ss commented 7 years ago

Hi, since Semantic Logging is also from Microsoft, I'm wondering what your guidance is on when we should use which.

It seems like the main difference is that Semantic Logging can be used out-of-process, however there hasn't been much activity lately and the new .NET 4.6 features (rich payload, ...) are not yet supported.

karolz-ms commented 7 years ago

Christian, as of today there are no plans to do any major development in Semantic Logging codebase. On the other hand, EventFlow is under active development. The guidance on which one to choose can be summarized as follows:

  1. If you are happy with what Semantic Logging offers today, there is no need to abandon it. We are committed to fixing any code defects that are reported.
  2. For new applications we recommend to take a look at what EventFlow offers and combine it with your logging API of choice. EventFlow offers a lot of flexibility in terms of where the data is coming from, how it is processed and where it can be sent.
  3. In-process application trace gathering may be sufficient depending on your needs, but for best coverage we recommend combining in-process trace gathering with out-of-process trace gathering. Our current out-of-process mechanism of choice for Azure applications, including Fabric applications, is Azure Diagnostics (https://docs.microsoft.com/en-us/azure/azure-diagnostics). On premises, if you are OMS customer, you can use OMS Agent (https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-windows-agents). We are working on improving and/or combining the two, but it is too early to comment on specifics. As for EventFlow we currently do not have any plans for adding any out-of-process facility to it. You can, of course, also use a non-Microsoft agent such as FluentD.
  4. Also please keep in mind that on Azure we are continuously improving the experience of working with diagnostics data from Azure resource providers (VMs, VM scale sets, Azure networking etc.) (we use the term "Azure Monitoring" to describe this experience). Our long-term design goal is to combine and leverage both application-specific data and RP-specific data to give you an easy to use and powerful experience for monitoring and diagnosing applications running on Azure.

Hope this helps!

qubitron commented 7 years ago

@cwe1ss thanks for your question -- I was wondering if out-of-process is an important feature to you, if so, why?

cwe1ss commented 7 years ago

Thank you very much for the great and quick response!!

@karolz-ms : The OMS agent doesn't support ETW events, does it?!

@qubitron:

We are pretty much all-in on Azure with our upcoming project and we use Service Fabric, OMS, Azure Diagnostics, etc. I'm also experimenting with Application Insights. However, I'm not really happy with what we have so far. Actually, I'm on a pretty long journey to find a good logging solution for our system.

My perfect scenario looks like this:

In-process-logging has advantages as well, but it also brings the burden of configuration, resilience, buffering into each application. I don't want to redeploy/restart 50 applications when my Application Insights ID, etc. changes.

I'm happy to discuss this further should you be interested.

PS: I'm also working on the .NET implementation for opentracing.io - it's somehow related to this topic as well!

karolz-ms commented 7 years ago

@cwe1ss, indeed, looks like OMS agent does not support ETW events :-(

Having said that, thank you very much for comprehensive reply and sharing your logging journey with us. It is very helpful. We have been on our version of the journey here for a long time too. I am glad to say that a lot of what you have discovered matches very well with our thinking, specifically:

  1. Desire for the application code to just "log the information" and not be involved in any kind of filtering/classification/sending the data out of nodes.
  2. Desire to avoid cluster-wide deployments, esp. through ARM, for diagnostic configuration
  3. Ability to re-configure diagnostics in production without stopping the service

But I noticed one key difference between what you said and what we were targeting with EventFlow and other work here at Microsoft. We assume that applications running on a Fabric cluster are largely independently configured, and that includes diagnostics too. In other words, we are aiming for microservice-style approach.

For example, if you have 50 applications, we think you will also have close to 50 distinct ApplicationInsights instrumentation keys.

This makes a cluster-wide configuration for diagnostics very unappealing--it couples the services in such a way, that when you want to change something about diagnostics for a particular service, you are forced to change a cluster-wide configuration piece. How it is done does not matter that much--even if the process is convenient and fast, it is still breaking the isolation of services. In fact, this necessity for cluster-wide configuration changes with WAD agent was #1 adoption blocker for Fabric for some teams here at Microsoft, and that (and a few other things) prompted development of EventFlow in the first place.

So, bottom line, we thought that having diagnostic configuration be part of service configuration is actually a desirable thing.

I do agree with the fact that in-process logging brings the problems of resilience, buffering, having multiple outbound connections from all cluster nodes etc. EventFlow is designed specifically to mitigate some of these problems. We also are discussing the possibility of taking best of both in-process and out-of-process approaches, so that the process of sending data from the node is factored out into a separate "log forwarder" service. That would reduce the number of outbound connections and buffering/retry/network handling logic. We do not have anything specifc to share yet. This approach is a performance/resilience optimization though; it would not fundamentally change the principle of "diagnostic configuration is service configuration".

Please do share more thoughts and thanks again Karol

cwe1ss commented 7 years ago

I'm happy to hear that we are on the same journey! :)

I see the following issues with what you've described regarding separate configurations:

The current Application Insights pricing just doesn't allow you to have separate instances for each service in a microservices architecture. The GA pricing is per instance AND per node. With 50 services on 10 servers, you would have to pay for 500 instances - currently 15$ * 500 = 7500$ per month. And this doesn't yet include data volume. This is too expensive - at least for us.

Application Insights accounts are too isolated. You need the "big picture" and reports that cover all accounts. Azure Application Insights Analytics would be nice for this, but it currently doesn't work across accounts. This means, you must use OMS as well and the "Connector for OMS Log Analytics". This also means you have to pay twice for your data volume.

Even if a system consists of many small parts, logging/tracing should usually go into one central system. I don't think it's common that companies will log half of their services in Application Insights and the other half in e.g. New Relic. Requiring different instrumentation keys is just a necessary implementation detail of Application Insights.

Having to create a separate Application Insights account for each service is too much ceremony. In a microservices world, services come and go. Ideally, the logging system (e.g. Application Insights, Log Analytics) would just see a new process name and automatically detect this as a new service.

There's also different environments so you would quickly end up with hundreds of Application Insights accounts.

Most services share common reporting requirements. Environments typically consist of web-based frontend services, request based backend services (web, queueing, ...) and background workers. I don't want to configure every Application Insights account with my required set of customizations.

So, after having told my "perfect solution for log creation", here's my "perfect solution for log processing/reporting" on Azure:

Log Analytics has come a long way lately, especially with the new custom dashboards feature. Making this the one true logging solution that combines infrastructure and application data would be awesome.

Obviously, this is more of a dream than a request... but one may dream! :smile:

karolz-ms commented 7 years ago

Christian, thanks for a comprehensive reply and many good thoughts.

I agree that the cost of AI (or, in general, the diagnostic backend you use) is a big concern. As you pointed out, for AI specifically, having a separate instrumentation key per microservice is not cost effective, nor desired from the perspective of correlating data across multiple iKeys (which is hard). This is not what I meant to suggest though :-) -- I had an instrumentation key per application in mind, i.e. a set of related services. Sorry if it did not come out clear. I also agree that even per-application iKey might be too granular.

That said, there is also performance aspect that points in the opposite direction, towards partitioning your diagnostic data. Even with AI Enterprise plan, the limits on how much data you can send (see the docs) are not very high. They may be lifted in future, but whatever they are, at some scale point you might hit the limit and be forced to partition. This thinking applies to whatever backend you use. Depending on the size of your VM scale set or Service Fabric cluster, the mix of applications running on it and the capabilities/cost of your diagnostic backend(s), there is an optimal degree of reuse. It may be typically high, but I don't think we can assume it is single backend per cluster.

Also we need to think about full set of scenarios that we need to cover for microservices on Azure. In the case of Service Fabric in particular, multiple instances of the same application, and multiple versions of the same application may run side-by-side on the same cluster. In fact, some of our Fabric use exactly the technique of instantiating the same application multiple times to ensure high degree of separation between application instances tied to their large-scale customers. A common requirement in this case is to send a portion of application logs to whatever customer diagnostic store might be. Obviously you do not want to mix the logs for these different customers, who might be competitors! So every instance has its own backend (plus the backend that is used by the application provider for their own purposes). Note that this also means the process name is not sufficient to distinguish between service processes running on behalf of different customers.

The way I read your "perfect solution for log processing/reporting" is that at the core you want low-ceremony, remote, easy to customize, and fast-applicable configuration of diagnostics for services. And you want some predefined templates to get you started quickly. If that is the goal, makes perfect sense. I believe this can be achieved in various ways. It does not matter that much is the service process or a dedicated daemon sending the data out of the nodes. For reasons listed above I still believe that, architecturally, it makes more sense if the diagnostic configuration is associated with individual services (deployed service instances). The challenge is then to have an admin experience inside OMS/AI that meets your goals. I think we can meet this challenge (well, gradually :-) ).

Oh, and we are working on merging OMS Log Analytics and Application Insights and its "Log Analytics" part. No details to share at this point, but this is a definitive, upper-management-demanded long-term goal.

cwe1ss commented 7 years ago

For people who are reading this. I just found the following on the Application Insights pricing page:

What if I use the same node for multiple applications that I’m monitoring?

No problem. We only count the unique nodes sending telemetry data within your Azure subscription (billing account). For instance, if you have 5 separate websites running on the same physical server and each website is configured with Application Insights Enterprise (charged per node) then collectively these will count as one node.

You can also have applications using Application Insights Basic (charged per GB) in the same Azure subscription and this will not affect the node counting for the applications that use Application Insights Enterprise.

Not sure when this was introduced but this is a very interesting and welcome change!