Agent initialisation error due to Lazy logic in Full Framework Elastic APM module

elastic / apm-agent-dotnet

https://www.elastic.co/guide/en/apm/agent/dotnet/current/index.html

Apache License 2.0

584 stars 209 forks source link

Agent initialisation error due to Lazy logic in Full Framework Elastic APM module #596

Closed NeoXtreem closed 4 years ago

NeoXtreem commented 4 years ago

Using the current version (1.1.2) of Elastic.Apm.AspNetFullFramework and following the configuration instructions here, I get following exception on startup:

The singleton APM agent has already been instantiated and can no longer be configured

I've stepped into the Elastic APM code, and it appears that ElasticApmModule.Init calls Agent.Instance which has the effect of causing Lazy.IsValueCreated to always return true thereafter. ElasticApmModule.Init then continues on the next line to indirectly call Agent.Setup which throws an exception if Lazy.IsValueCreated returns true (which it does because of the previous line). Therefore, this is a showstopping bug as the Full Framework Elastic APM module simply cannot be used for ASP.NET with this initialisation logic.

It appears @chrisvanderpennen experienced a similar problem on #244, and also @OlegUfaev here on #441, but that relates to ASP.NET Core and can easily be circumvented with a try-catch as @pburrows suggested here. But in my case, I'm using ASP.NET with .NET Framework, so I don't have that option as config is done through Web.config.

Please can this be fixed, and a workaround provided in the meantime?

Stack trace:

[Exception: The singleton APM agent has already been instantiated and can no longer be configured] Elastic.Apm.Agent.Setup(AgentComponents agentComponents) +139 Elastic.Apm.AspNetFullFramework.<>cDisplayClass24_0.b0() +88 Elastic.Apm.Helpers.<>cDisplayClass6_0.b0() +78 System.Threading.LazyInitializer.EnsureInitializedCore(T& target, Boolean& initialized, Object& syncLock, Func`1 valueFactory) +139 Elastic.Apm.Helpers.LazyContextualInit.Init(Action initAction) +300 Elastic.Apm.Helpers.IfNotInitedHelper.Init(Action initAction) +56 Elastic.Apm.AspNetFullFramework.ElasticApmModule.InitOnceForAllInstancesUnderLock(String dbgInstanceName) +277 Elastic.Apm.AspNetFullFramework.ElasticApmModule.Init(HttpApplication context) +180 System.Web.HttpApplication.RegisterEventSubscriptionsWithIIS(IntPtr appContext, HttpContext context, MethodInfo[] handlers) +587 System.Web.HttpApplication.InitSpecial(HttpApplicationState state, MethodInfo[] handlers, IntPtr appContext, HttpContext context) +173 System.Web.HttpApplicationFactory.GetSpecialApplicationInstance(IntPtr appContext, HttpContext context) +255 System.Web.Hosting.PipelineRuntime.InitializeApplication(IntPtr appContext) +347

[HttpException (0x80004005): The singleton APM agent has already been instantiated and can no longer be configured] System.Web.HttpRuntime.FirstRequestInit(HttpContext context) +552 System.Web.HttpRuntime.EnsureFirstRequestInit(HttpContext context) +122 System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context) +737

SergeyKleyman commented 4 years ago

Hi @NeoXtreem

Thank you very much for thorough analysis and we definitely would like to fix/provide a workaround for this or any other issue as soon as possible.

Regarding your analysis - I'm not sure I fully follow the part:

..., and it appears that ElasticApmModule.Init calls Agent.Instance

I see that the first line in ElasticApmModule.Init is call to InitOnceForAllInstancesUnderLock and InitOnceForAllInstancesUnderLock in turn calls Agent.Setup Could you please point where ElasticApmModule.Init calls Agent.Instance before indirect call to Agent.Setup?

Unfortunately I don't see any workaround that won't require code changes but as a quick fix that we can try to release ASAP (i.e., well before the next planned release) is to make ElasticApmModule use already existing Agent.Instance. It might cause the Agent to use wrong configuration (for example env vars instead of Web.config) but at least the Agent won't crash the app.

gregkalapos commented 4 years ago

@NeoXtreem thanks for reporting! I'd also have some quick questions.

Does this always happen in your app? Already on the 1. HTTP call into the application?

Do you maybe have a small reproducer?

SergeyKleyman commented 4 years ago

@NeoXtreem Is it possible that you call Agent's API (using Agent.Tracer) before ElasticApmModule.Init is called, for example from Application_Start?

NeoXtreem commented 4 years ago

@gregkalapos @SergeyKleyman Thank you for your replies, and I'm glad to see that this issue resulted in an improvement, although I have since found that the root cause of the particular issue I was experiencing was on a different branch (of my own fork) where I'd mistakenly introduced a bug by making a seemingly-innocuous change. I reverted it a couple of days ago as you can see here. As you'll see from that diff, I'd put the call to InitOnceForAllInstancesUnderLock in line to where the variable isInitedByThisCall was being used. Of course, the clue is in the variable name!

However, this leads me to a few things...

1. Unit testing

Perhaps there should be a unit test to protect against this?

2. Design

And the design seems a bit iffy to me. Basically, a call to the property Agent.Instance is having a side effect that affects behaviour depending on whether a method (namely Agent.Setup) is called before or after reading that property. This is generally bad practice as stated by Microsoft here:

It is a bad programming style to change the state of the object by using the get accessor. For example, the following accessor produces the side effect of changing the state of the object every time that the _number field is accessed.
private int _number;
public int Number => _number++;   // Don't do this

Of course, I recognise that in this case, the side effect is down to the use of Lazy and its property IsValueCreated which is exposed via Agent.IsInstanceCreated. Although this seems inconsequential on the face of it, this logic effectively causes a side effect on the behaviour of Agent properties and methods simply by calling other Agent properties get accessors.

If you really want to cause this side effect, then this should not surprise the programmer, and changing Instance to a method such as GetInstance, and Tracer to a method called GetTracer might be better here. At least then, someone changing Elastic APM source code can expect that GetInstance might cause a side effect. And a client application calling GetTracer can also expect the same. However, even then, I would be cautious about the naming as Get on a method also implies it has no side effects. Perhaps it would be better to have an additional method called TryGetTracer or even better GetTracerIfConfigured that returns null if it hasn't yet been configured. There's a good use case for this which I've put in my reply to @SergeyKleyman's comment below.

3. Client use case

@NeoXtreem Is it possible that you call Agent's API (using Agent.Tracer) before ElasticApmModule.Init is called, for example from Application_Start?

Notwithstanding all the above, it is in fact the case that I am calling Agent.Tracer in my client application because it uses a logger (Serilog) which initialises at startup and can start logging exceptions before Elastic APM Agent has fully initialised. It logs exceptions using Agent.Tracer.CurrentTransaction.CaptureException as advised here. However, this means that ApmAgent is initialised from Agent with _components set to null. This results in AgentComponents setting its ConfigurationReader property to an instance of EnvironmentConfigurationReader. As a result, the Elastic APM config in my Web.config is not read because this requires FullFrameworkConfigReader to be the ConfigurationReader.

Therefore, the additional method of GetTracerIfConfigured as suggested above would be highly useful in cases such as this, unless you know of an alternative method or workaround that should be used in such cases?

gregkalapos commented 4 years ago

Thanks for letting us know @NeoXtreem.

Quick info on how/what I'd release:

We wanted to quickly push a patch to address this with the try-catch added in #597 (PR already open), but given that this was specific to your fork and probably won't happen with the released package, we will skip that patch and just release #597 with the next minor release (which will be very soon).

I leave the issue open to discuss the rest.

NeoXtreem commented 4 years ago

I'm in the process of creating a PR for this, but here is some more info about the antipattern that is a stateful singleton that is being used in this case. As described at the aforementioned link:

A singleton is a convenient way for accessing the service from anywhere in the application code.

The model quickly falls apart when the service not only provides access to operations but also encapsulates state, which affects how other code behaves. Application configuration is a good example of this. In the best case, the configuration is read once at the application start and does not change for the entire lifetime of the application.

However, different configuration can cause a method to return different results although no visible dependencies have changed, i.e. the constructor and the method have been called with the same parameters. This can become an even bigger problem if the singleton state can change at runtime, either by rereading the configuration file or by programmatic manipulation. Such code can quickly become very difficult to reason with:
var before = new MyClass().CalculateResult(3, 2);// depends on Configuration.Instance
RefreshConfiguration(); // modifies values in Configuration.Instance
var after = new MyClass().CalculateResult(3, 2); // depends on Configuration.Instance
Without comments, an uninformed reader of the code above could not expect the values of before and after to be different, and could only explain it after looking into the implementation of the individual methods, which read and modify global state hidden in Configuration singleton.

I've asked a question on SO to get ideas on how to resolve this without DI, and one answer suggests preventing access (by way of throwing an exception) unless Agent.Setup has been called. This makes sense as it means the Agent requires initialisation before being used (standard logic I'd say), but this may be a breaking change for some users.

So, a short-term workaround may still be to add a mechanism to only get the instance if it has (or hasn't) been configured.

gregkalapos commented 4 years ago

@NeoXtreem here are some thoughts on this from me.

1. Unit testing

Yeah, sure, we can add tests, in fact I just went through your PR and that was one thing I mentioned. We can of course also have 1 test combined with the IIS module.

2. Design

So, the thing here is the following: since the agent is configurable, and it also encapsulates the tracer which knows about things like active transaction and spans across child tasks, there is a huge advantage that the Agent is a singleton.

For example you can write Agent.Tracer.CurrentTransaction in an ASP.NET Controller and it’ll give you the current transaction - already filled with all the request details, plus it also knows about all the configs (what log level to use, on which URL the APM Server listens, etc.)

So, that is the reason for the singleton instance.

However, we want the agent to be usable in any environment. In ASP.NET Core and in ASP.NET we initialize the agent - meaning we call Agent.Setup(config);

There are other cases, where only the Public Agent API is used from the agent - for example a console application - or in fact we heard it’s also used on Xbox (yeah, sounds very cool!). In this scenario users start transactions and spans manually.

There was a discussion back when we started, that in case you just use the Public API in an environment where the agent is not initialized by us (e.g. console application) - should be some configuration step needed? We usually default to zero configuration, which in this case mans you don’t need to call anything, the agent will automatically read config from environment variables and you can just start using Agent.Tracer.StartTransaction(). So if you don’t call Agent.Setup() we use the defaults, and this makes the life of users using the agent in those environments easier.

Now, in terms of whether we use an anti pattern or not - I’m totally open for a better approach to cover all this, but to be honest I’d rather have something that works fairly acceptable in any environment without major drawbacks.

If you really want to cause this side effect, then this should not surprise the programmer, and changing Instance to a method such as GetInstance, and Tracer to a method called GetTracer might be better here.

This is not a bad idea, my only issue with this is that for the “average case” this makes things complicated. Now you can say Agent.Tracer, but then you’ll do something like var tracer = Agent. GetTracerIfConfigured(); if(tracer != null)… Of course, this can be simplified, but what I try to say is that a plain property is the simplest, and that is what you need in most of the cases, except when you do something before the agent is initialized. An average case to me is something like: you are in your controller method and you want to start a span in it, or you simply use the public Agent API in an environment where we don’t have auto instrumentation - things like this, these are the normal use cases of the API.

So, overall, yes, the method has 1 side effect, which is that if the agent is not initialized, it’ll initialize it with the defaults, that is true for the 1. call, for all the other calls this does not matter, furthermore this does not matter when you use the API in an environment where we configured the agent for you.

The questions is, does it worth adding the complications for every user in every use case?

3. Client use case

Just by reading that part of your comment, I think the problem here is that the initialization logic does not allow you to replace the logger, which is definitely a problem, but I think this is more specific to the IIS module. Probably none of the suggestions above would really help with this, right? Because that way you would configure the agent and our configuration step would already fail - or be skipped.

To the suggestions one StackOverflow

The suggestion given in this answe was considered. As mentioned above, currently you can include the agent in any application and immediately start tracing things and the agent will by default read configs from environment variables. By throwing an exception we’d loose this.

Overall, to me the comment below your question on StackOverflow is the one that seems the most reasonable: if we offer a property to query if the agent is initialized you can use that in scenarios where you are potentially in a stage where the agent is not initialized (I’d say these are a small percentage of calls to the Agent) - and then use the agent as it is in all the other cases - and this is what your PR already does. We can also boost the experience here with some documentation (both in the agent docs and directly in the .cs files in the xmldocs.

I hope this helped with some additional details.

SergeyKleyman commented 4 years ago

@NeoXtreem

1. Unit testing

Perhaps there should be a unit test to protect against this?

Introducing the change that was reverted by https://github.com/NeoXtreem/apm-agent-dotnet/commit/3835aa4a442b9ee007e2004080eb1f0e4c22948c#diff-5fd6f14d78670b54d69ae733569c0c11 on top of 1.1.2 will break almost all the tests in Elastic.Apm.AspNetFullFramework.Tests since the sample application will return 500 on every request. I tested it in https://github.com/elastic/apm-agent-dotnet/pull/614 - and here are the results.

gregkalapos commented 4 years ago

@NeoXtreem

1. Unit testing Perhaps there should be a unit test to protect against this?

Introducing the change that was reverted by NeoXtreem@3835aa4#diff-5fd6f14d78670b54d69ae733569c0c11 on top of 1.1.2 will break almost all the tests in Elastic.Apm.AspNetFullFramework.Tests since the sample application will return 500 on every request. I tested it in #614 - and here are the results.

That's good! So we can say this is covered :) Maybe not really "unit tests" but I think that is fine.

Maybe worth pointing out for @NeoXtreem that these tests won't run by default when you run all the unit tests from Visual Studio or Rider, the readme file here has all the details on how to turn those on.

NeoXtreem commented 4 years ago

@gregkalapos Thanks for your detailed response. Personally, I'm an advocate of conforming to well-established standards (by "well-established", virtually anything in Microsoft .NET documentation qualifies) such as the one I mentioned concerning get accessors and side effects. The "average" case is subjective and it is difficult to foresee how the API may be used in the wild if there is more than one way to use it. For example, our use case (with the logger initialising and calling Agent.Tracer before Elastic APM Agent calls Agent.Setup) caused us to come a cropper and cost us over a day of dev time to pinpoint. Who's to say many others won't be using Elastic APM Agent this way?

To add some context, the setup of the logger is during a method called by Umbraco (the CMS we use) during Global.asax that is called early, and trace logs happen frequently thereafter. Therefore, a solution such as GetTracerIfConfigured would indeed work because it would be fine to not do anything if it returns null. But crucially, GetTracerIfConfigured would not cause Lazy to be evaluated. The idea would be to check if a private bool has been set (it would be set during Setup), and return null if it hasn't.

All that being said, with respect to the "average" case, my PR #609 will at least provide a means to get around the default configuration for "non-average" cases, and your documentation should hopefully cover this.

But I would still propose to change the properties to methods to adhere to established standards and meet the expectations of all programmers (or at least not surprise them). I mean, how hard is it to call GetTracer() instead of Tracer? The body doesn't even need to change (it can remain Instance.Tracer), but at least it will adhere to Microsoft guidelines (and programmers' expectations in terms of it not being a property that has side effects). And even if you don't agree to this, then at least Instance could be changed to GetInstance() (keep body the same) as this is not part of the public API, and it would at least make the internal code adhere to the expectations of contributors to Elastic APM Agent itself.

If you do agree to having GetTracer() instead of Tracer, but with the same body, then this is not a behavioural change; so, IsConfigured (as per my PR) would still be required. Alternatively, you could have GetTracer() instead of Tracer (with the same body) and also GetTracerIfConfigured which will return null if a private bool is false (it would be set to true during Setup).

Let me know your thoughts.

PS: Glad the Elastic.Apm.AspNetFullFramework.Tests test the change I reverted as @SergeyKleyman mentioned. I wasn't aware there were tests that are not run by default.

gregkalapos commented 4 years ago

Thanks for all the feedback here.

The initialization problem is solved here - changes on the API are not planned at this point, so we'd prefer to keep those.