DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
571 stars 283 forks source link

JMS Event processing duration gets enormously large after thrown unchecked exception #3076

Closed mladendinev closed 2 years ago

mladendinev commented 3 years ago

library version: com.datadoghq:dd-trace-api:0.82.0 framework: spring-boot-starter:1.3.7

Hi,

I've been trying the past few days to understand why Datadog is reporting such huge duration for a failed event although it should have been way shorter if we add up the duration of all the sub-items. I've noticed that such time spans are only seen for failed events, the normal execution takes max 10-20 seconds. I'm not sure if the flow matters but this what it does:

Screenshot 2021-09-17 at 18 28 23

mcculls commented 3 years ago

Hi @mladendinev are you able to try the 0.87.0 release? it contains a number of JMS fixes since 0.82.0

mladendinev commented 3 years ago

Hi, @mcculls. I've updated today the library to version 0.87.0 but the issue remained.

mcculls commented 3 years ago

@mladendinev ok, we'll need to take a closer look at the trace data - note that the duration of a consume span can depend on how you're receiving messages (ie. by iterating or with a listener) and the acknowledgement mode. For example if you're iterating over messages using auto-acknowledgement then the consume span duration will be from the time the message was received off the queue until the time the next message was received (or the consumer/session closed.)

Could you explain more about how you're receiving messages from the queue and what happens when there's a failed event (ie. if you're iterating over messages is the current iteration abandoned and what happens to the consumer and its session.)

You might want to open a support ticket to share that information along with a link to a recent trace. Also if several nodes participate in JMS processing could you make sure they're all using 0.87.0 - thanks

mladendinev commented 3 years ago

I have the gut feeling that the session acknowledgement mode is the main culprit based on your comments above. If we have a successful consumption of an event, it's discarded from the queue, the session is closed and the lifespan of the operation in Datadog is normal, however if the event fails, it's moved to a dead letter queue and replayed a few more times in regular intervals.

We don't close/alter the session(s) in the code, it should be automatically maintained by the framework (SpringBoot) itself, at least that's what the documentation says.

Here's the code - some details were omitted on purpose.

Configuration

  @Bean
    public DefaultJmsListenerContainerFactory jmsListenerContainerFactoryIncomingEvent(
            ConnectionFactory connectionFactory,
            MessageConverter messageConverter) {

        DefaultJmsListenerContainerFactory factory = new DefaultJmsListenerContainerFactory();
        factory.setConnectionFactory(connectionFactory);
        factory.setDestinationResolver(new DynamicDestinationResolver());
        factory.setConcurrency("5-10");
        factory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
        factory.setMessageConverter(messageConverter);

        return factory;
    }

    @Bean
    public AmazonSQS amazonSqsClient() {

        AWSCredentialsProvider awsCredentials = new DefaultAWSCredentialsProviderChain();

        return AmazonSQSClientBuilder.standard()
                .withClientSideMonitoringConfigurationProvider()
                .withCredentials(....)
                .withRegion(...)
                .build();
    }

Listener


    @JmsListener(destination = "...")
    public void onReceive(Message<EventWrapper> receivedMsg) {

        EventWrapper eventWrapper = receivedMsg.getPayload();
        String eventType = eventWrapper.getEventType();

        EventProcessor eventProcessor = eventProcessorRegistry.getEventProcessor(eventType);

        try {
            boolean isSuccessful = eventProcessor.processEvent(receivedMsg);
            if (!isSuccessful) {
                throw new Exception()
            }
        } catch (Exception e) {
            // throw exception to avoid acknowledging the message
             throw new Exception()
        } 
    }

P.S: I made a small experiment earlier and I fired two false events one after the other which were supposed to fail (reaching the catch block with the comment) and none of them appeared in our dashboard for quite some time. I checked later and I saw the two events with a duration 1h 23 min and 1h 24 min respectively.

mcculls commented 3 years ago

thanks @mladendinev that helped identify the issue, working on a fix

github-actions[bot] commented 2 years ago

:robot: This issue has been addressed in the latest release. See full details in the Release Notes.