Closed molsza closed 2 years ago
For investigative purposes, I've assumed this is a concurrency bug.
I've found where it's possible for this to occur from, in the DslJsonSerializer.serializeMessageHeaders() method.
I have found at least one way how it's possible, if a message is reused and is reset while the iterator is running through - you can have the situation where there is another item to write (so it's not writing the last item at the moment, so will add a comma) when it writes the item, but then when it comes to the next iteration, a reset() has been called, and there are no items left to write, so the loop terminates.
The examples are consistent with this, as the second document is clearly terminated earlier than the first with fields that it should have been writing now missing.
I haven't yet found why we would be doing that concurrent reset while iterating, but for me this is enough to promote this to bug status
@molsza do the agent logs contain any non-zero reference count
entries?
I agree this looks like a concurrency issue. My first hunch was that this is a visibility issue with how we use co.elastic.apm.agent.impl.context.Headers
(more precisely - the underlying map), but I cannot reproduce the issue under this assumption.
@molsza adding to @jackshirazi question: do you see next() called on a depleted iterator
or any other error in agent logs?
Hi @molsza we can't yet track down the underlying cause of the concurrency issue. Can you fill out some detail on your system? What libs are you using, which JMS client, what broker, if at all possible code samples of how you are obtaining the messages - the receive loop and callbacks? Thanks
@jackshirazi I will try to provide more information next week.
@molsza Should we keep this open? If you can provide the information requested by @jackshirazi ,then maybe we can investigate further. If you can share a simple app that reproduces this issue, that can be very useful. Lastly, did you add any manual instrumentation/tracing, or do you rely solely on the agent auto-instrumentation?
I am doing a lot of manual instrumentations. Now I think that it may my fault - looks like I am updating the transaction labels after the transaction is closed.
Thanks for updating! If you think you figured it out, please close the issue.
And if you don't figure this out, please provide the following info, so we can narrow it down:
TL;DR: everything is fine with the elastic APM agent. Transaction object has been used in multiple threads and modified after it was closed. Feel free to close the issue.
Problem was that with JMS instrumentation. Application is receiving JMS in one thread then it send to executor for processing. Just after message is send to executor, consumer method ends and APM is closing the transaction. However this transaction has been still used in executor (new spans, new labels). Further tests showed that error can happen in both serializing the transaction and we can have concurrent modification exception during adding the labels.
Solving the problem was not so straightforward as one could think. First I have tried to create a child transaction from existing one and passed this child transaction to executor and activate it there, but even though agent logged that transaction is create/activated/deactivated/ended it never showed in APM. Next attempt, this time successful, was to cancel the original JMS transaction (ignore it) and create a new one from scratch in executor thread.
Solving the problem was not so straightforward as one could think. First I have tried to create a child transaction from existing one and passed this child transaction to executor and activate it there, but even though agent logged that transaction is create/activated/deactivated/ended it never showed in APM. Next attempt, this time successful, was to cancel the original JMS transaction (ignore it) and create a new one from scratch in executor thread.
I think the easiest and cleanest solution for you is to find any method during which execution the message-processing task is submitted to the executor and trace it as a child span of the JMS transaction. It doesn't have to be the actual method that submits the message-processing task, but any method that starts before the task is submitted and ended after. Once you identified such method, you should:
start
and activate
a chid span from the JMS transaction, which is the active
transaction on the JMS consumer thread at the start of the method described aboveend
the span after message processing has finished on the executor threadIf you do that, you don't need to do anything special in order to propagate context between thread and to activate/deactivate on the executor thread - our concurrency plugin does that automatically. In addition, you won't lose your JMS consumer transaction info. The JMS receive transaction should be terminated and sent properly to the APM server and the async task will be traced as a span with its own lifecycle, to which you can add labels and custom data as much as you like before you decide to end it.
I hope this helps. Closing the issue anyway. Thanks for the input!
Creating a new span would work, even the labels which has caused this issue would be fine to have it in a spans instead of transaction. But the main point was that I want to change the name of the transaction to the actual method/service which process the message, and that I don't know until the message is handled to executor. Also I would not be able to set the outcome to transaction.
BTW, even though you can set the outcome to spans, it is not shown in APM UI in any way. It would be nice if there was some icon or spans name printed in red if it outcomes is set to failure.
We kindly ask to post all questions and issues on the Discuss forum first. In addition to awesome, knowledgeable community contributors, core APM developers are on the forums every single day to help you out as well, so your questions will reach a wider audience there. If we confirm that there is a bug, then you can open an issue in this repo.
Describe the bug
Agent version 1.26
I've noticed that when service is under a load agent is starting to send invalid JSON messages to APM server. It is happening around 500 event/s.
Example:
In the first message there is aditional ',' after
traceparent
in second affterJMSMessageID
.Another example when load was 2-4 times bigger. In this case however timeouts from APM server has been also observed, so maybe the cut in the message is somehow connected.
Steps to reproduce
Expected behavior
Debug logs
Click to expand
``` replace this line with your debug logs ```