Agent creates invalid JSONs when under load

molsza commented 3 years ago

We kindly ask to post all questions and issues on the Discuss forum first. In addition to awesome, knowledgeable community contributors, core APM developers are on the forums every single day to help you out as well, so your questions will reach a wider audience there. If we confirm that there is a bug, then you can open an issue in this repo.

Describe the bug

Agent version 1.26

I've noticed that when service is under a load agent is starting to send invalid JSON messages to APM server. It is happening around 500 event/s.

Example:

2021-10-28 10:22:48,864 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Server returned HTTP response code: 400 for URL: http://130.100.140.203:8200/intake/v2/events, response code is 400
2021-10-28 10:22:48,864 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Sending payload to APM server failed
java.io.IOException: Server returned HTTP response code: 400 for URL: http://130.100.140.203:8200/intake/v2/events
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900) ~[?:1.8.0_222]
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) ~[?:1.8.0_222]
        at co.elastic.apm.agent.report.AbstractIntakeApiHandler.endRequest(AbstractIntakeApiHandler.java:148) [?:?]
        at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.endRequest(IntakeV2ReportingEventHandler.java:161) [?:?]
        at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.handleEvent(IntakeV2ReportingEventHandler.java:77) [?:?]
        at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:65) [?:?]
        at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:36) [?:?]
        at co.elastic.apm.agent.shaded.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168) [?:?]
        at co.elastic.apm.agent.shaded.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125) [?:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
2021-10-28 10:22:48,865 [elastic-apm-server-reporter] WARN  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - {
  "accepted": 7035,
  "errors": [
    {
      "message": "decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Context: v2.context.Message: v2.contextMessage.Headers: ReadString: expects \" or n,",
      "document": "{\"transaction\":{\"timestamp\":1635409330926001,\"name\":\"JMS RECEIVE from queue Queue/TestQueue\",\"id\":\"66ceac8bfb66f904\",\"trace_id\":\"4bb69a9161a92271b4b58a28706f3d91\",\"parent_id\":\"03489a3271df33c4\",\"type\":\"messaging\",\"duration\":0.189,\"outcome\":\"success\",\"context\":{\"service\":{\"framework\":{\"name\":\"JMS\"},\"name\":null},\"message\":{\"headers\":{\"JMSMessageID\":\"ID:25d389d0-37c8-11ec-9d57-0000000011e4\",\"JMSExpiration\":\"0\",\"JMSTimestamp\":\"1635409330923\",\"traceparent\":\"00-4bb69a9161a92271b4b58a28706f3d91-03489a3271df33c4-01\",},\"age\":{\"ms\":3},\"queue\":{\"name\":\"Queue/TestQueue\"}},\"tags\":{}},\"span_count\":{\"dropped\":0,\"started\":0},\"sample_rate\":1.0,\"sampled\":true}}"
    },
    {
      "message": "decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Context: v2.context.Message: v2.contextMessage.Headers: ReadString: expects \" or n,",
      "document": "{\"transaction\":{\"timestamp\":1635409331060002,\"name\":\"JMS RECEIVE from queue Queue/TestQueue\",\"id\":\"bf2ab4df01ff57e4\",\"trace_id\":\"69a46689ecbd73ecf02ed47e3a5d19e5\",\"parent_id\":\"95749c6d40e08d8f\",\"type\":\"messaging\",\"duration\":0.209,\"outcome\":\"success\",\"context\":{\"service\":{\"framework\":{\"name\":\"JMS\"},\"name\":null},\"message\":{\"headers\":{\"JMSMessageID\":\"ID:25e7fc30-37c8-11ec-b0b6-0000000011f4\",},\"age\":{\"ms\":17},\"queue\":{\"name\":\"Queue/TestQueue\"}},\"tags\":{}},\"span_count\":{\"dropped\":0,\"started\":0},\"sample_rate\":1.0,\"sampled\":true}}"
    }
  ]
}

In the first message there is aditional ',' after traceparent in second affter JMSMessageID.

Another example when load was 2-4 times bigger. In this case however timeouts from APM server has been also observed, so maybe the cut in the message is somehow connected.

2021-10-27 11:11:56,569 [elastic-apm-server-reporter] WARN  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - {
  "accepted": 565,
  "errors": [
    {
      "message": "decode error: data read error: v2.transactionRoot.Transaction: v2.transaction.Name: ReadString: expects \" or n,",
      "document": "{\"transaction\":{\"timestamp\":1635325901078001,\"name\":"
    }
  ]
}

Steps to reproduce

Expected behavior

Debug logs

Click to expand

``` replace this line with your debug logs ```

jackshirazi commented 3 years ago

For investigative purposes, I've assumed this is a concurrency bug.

I've found where it's possible for this to occur from, in the DslJsonSerializer.serializeMessageHeaders() method.

I have found at least one way how it's possible, if a message is reused and is reset while the iterator is running through - you can have the situation where there is another item to write (so it's not writing the last item at the moment, so will add a comma) when it writes the item, but then when it comes to the next iteration, a reset() has been called, and there are no items left to write, so the loop terminates.

The examples are consistent with this, as the second document is clearly terminated earlier than the first with fields that it should have been writing now missing.

I haven't yet found why we would be doing that concurrent reset while iterating, but for me this is enough to promote this to bug status

jackshirazi commented 3 years ago

@molsza do the agent logs contain any non-zero reference count entries?

eyalkoren commented 3 years ago

I agree this looks like a concurrency issue. My first hunch was that this is a visibility issue with how we use co.elastic.apm.agent.impl.context.Headers (more precisely - the underlying map), but I cannot reproduce the issue under this assumption. @molsza adding to @jackshirazi question: do you see next() called on a depleted iterator or any other error in agent logs?

jackshirazi commented 3 years ago

Hi @molsza we can't yet track down the underlying cause of the concurrency issue. Can you fill out some detail on your system? What libs are you using, which JMS client, what broker, if at all possible code samples of how you are obtaining the messages - the receive loop and callbacks? Thanks

molsza commented 3 years ago

@jackshirazi I will try to provide more information next week.

eyalkoren commented 3 years ago

@molsza Should we keep this open? If you can provide the information requested by @jackshirazi ,then maybe we can investigate further. If you can share a simple app that reproduces this issue, that can be very useful. Lastly, did you add any manual instrumentation/tracing, or do you rely solely on the agent auto-instrumentation?

molsza commented 2 years ago

I am doing a lot of manual instrumentations. Now I think that it may my fault - looks like I am updating the transaction labels after the transaction is closed.

eyalkoren commented 2 years ago

Thanks for updating! If you think you figured it out, please close the issue.

eyalkoren commented 2 years ago

And if you don't figure this out, please provide the following info, so we can narrow it down:

How do you apply manual instrumentation? Do you use the public API? If so, what version of the public API jar are you using?
What exactly do you do? Can you share your manual instrumentation code?

molsza commented 2 years ago

TL;DR: everything is fine with the elastic APM agent. Transaction object has been used in multiple threads and modified after it was closed. Feel free to close the issue.

Problem was that with JMS instrumentation. Application is receiving JMS in one thread then it send to executor for processing. Just after message is send to executor, consumer method ends and APM is closing the transaction. However this transaction has been still used in executor (new spans, new labels). Further tests showed that error can happen in both serializing the transaction and we can have concurrent modification exception during adding the labels.

Solving the problem was not so straightforward as one could think. First I have tried to create a child transaction from existing one and passed this child transaction to executor and activate it there, but even though agent logged that transaction is create/activated/deactivated/ended it never showed in APM. Next attempt, this time successful, was to cancel the original JMS transaction (ignore it) and create a new one from scratch in executor thread.

eyalkoren commented 2 years ago

Solving the problem was not so straightforward as one could think. First I have tried to create a child transaction from existing one and passed this child transaction to executor and activate it there, but even though agent logged that transaction is create/activated/deactivated/ended it never showed in APM. Next attempt, this time successful, was to cancel the original JMS transaction (ignore it) and create a new one from scratch in executor thread.

I think the easiest and cleanest solution for you is to find any method during which execution the message-processing task is submitted to the executor and trace it as a child span of the JMS transaction. It doesn't have to be the actual method that submits the message-processing task, but any method that starts before the task is submitted and ended after. Once you identified such method, you should:

start and activate a chid span from the JMS transaction, which is the active transaction on the JMS consumer thread at the start of the method described above
close the activation scope after the message-handling task have been submitted to the executor
end the span after message processing has finished on the executor thread

If you do that, you don't need to do anything special in order to propagate context between thread and to activate/deactivate on the executor thread - our concurrency plugin does that automatically. In addition, you won't lose your JMS consumer transaction info. The JMS receive transaction should be terminated and sent properly to the APM server and the async task will be traced as a span with its own lifecycle, to which you can add labels and custom data as much as you like before you decide to end it.

I hope this helps. Closing the issue anyway. Thanks for the input!

molsza commented 2 years ago

Creating a new span would work, even the labels which has caused this issue would be fine to have it in a spans instead of transaction. But the main point was that I want to change the name of the transaction to the actual method/service which process the message, and that I don't know until the message is handled to executor. Also I would not be able to set the outcome to transaction.

BTW, even though you can set the outcome to spans, it is not shown in APM UI in any way. It would be nice if there was some icon or spans name printed in red if it outcomes is set to failure.

elastic / apm-agent-java