KafkaProducer does not guarantee at least once message delivery in consistent region

ddebrunner commented 7 years ago

Drain processing (which is not implemented) needs to ensure that all messages have been published to the broker. With the current behaviour a window exists where tuple processing is complete but the message is not at the broker. Thus a failure after the consistent region has completed but before the message reaches the broker will result in lost messages. Basically need to track the Future that is returned from the send.
Upon a failure to send the message, the operator just continues processing, logging the tuple (which it probably should not do). Thus the tuple is never sent as a message. The operator should fail in order to force a region restart.

ddebrunner commented 7 years ago

As reference: https://github.com/IBMStreams/streamsx.messaging/wiki/Consistent-Region-Support-for-Messaging-Toolkit

chanskw commented 7 years ago

Not sure if we need a new issue, but I think in general, the async call and exception handling is not correct.

In KafkaSink:

@Override
    public void process(StreamingInput<Tuple> stream, Tuple tuple) throws FileNotFoundException, IOException, UnsupportedStreamsKafkaConfigurationException{
        try {   
            if(trace.isLoggable(TraceLevel.DEBUG))
                trace.log(TraceLevel.DEBUG, "Sending message: " + tuple);

            if(!topics.isEmpty()) 
                producerClient.send(tuple, topics);
            else 
                producerClient.send(tuple);
        } catch(Exception e) {
            trace.log(TraceLevel.ERROR, "Could not send message: " + tuple, e);
            e.printStackTrace();
            resetProducerIfPropertiesHaveChanged();
        }

        if(producerClient.hasMessageException() == true){
            trace.log(TraceLevel.WARN, "Found message exception");
            resetProducerIfPropertiesHaveChanged();
        }
    }

Problem is that producerClient.send(...) is actually an asynchronous method. The exception can be set at a later time when the callback is called. However, this code assumes send(...) is synchronous and check for message exception. Now, if we get an exception, we reset the producer properties, which I believe we are getting the properties from the app config and reconnecting.

In here:

private void resetProducerIfPropertiesHaveChanged()
            throws FileNotFoundException, IOException, UnsupportedStreamsKafkaConfigurationException {
        OperatorContext context = this.getOperatorContext();
        if (newPropertiesExist(context)){
            trace.log(TraceLevel.INFO,
                    "Properties have changed. Initializing producer with new properties.");
            resetProducerClient(context);
        } else {
            trace.log(TraceLevel.INFO, "Properties have not changed, so we are keeping the same producer client and resetting the message exception.");
            producerClient.resetMessageException();
        }
    }

Why do we want to reset if we have a message exception when the app config properties have not changed? Is this the right logic that app config properties affect whether the producerClient has received an exception or not?

ddebrunner commented 7 years ago

See PR #287

ddebrunner commented 7 years ago

@chanskw Yeah, I noticed that, and I think that send cannot throw an exception, so we shouldn't be catching anything there.

Agreed the resetMessageException() should not happen if an exception can be thrown from send (that specific method goes away with #287 but the same issue is there). Good catch!

ddebrunner commented 7 years ago

Another related issue is what should happen if there's an exception sending a message in an autonomous region. Most operators will fail, unless they have explicit parameter indicating not to (or exceptions are being caught from the SPL app).

Changing the operator to fail would be a change in behaviour, but I think it's the correct thing to do, currently an application can look healthy but it's actually not writing any messages.

chanskw commented 7 years ago

@ddebrunner do you have any advise on the best practices to handle exceptions in operators. Before CR was introduced, we generally try to catch the exceptions, log the error and then submit tuples to the error port. The reason is that we want to be more resilient and not cause the application to crash. The downside is that the application may seem healthy when it really is not.

If the operator is in a CR, upon an exception, I believe the operator should now cause a reset to happen. This can be done by rethrowing the exception. However, what happens when the user has added the @exception annotation to the code. In this case, I believe the exception is caught by the runtime, and the PE will not crash. Therefore, a reset will not happen. I cannot remember exactly, but is there an API for the operator to trigger a reset? Is this the right approach upon an exception?

In the case when the operator is not in a CR, should operators just crash? should we be resilient and submit error tuples? What is the right behavior here?

ddebrunner commented 7 years ago

Short answer: it depends. :-)

If the application has explicitly handled errors in some ways, e.g. optional error port, @catch annotation parameter indicating ignore errors, etc. then I don't believe the operator should fail or reset the region.

Otherwise I believe the operator should fail which will cause to reset the region or just restart in an autonomous region.

Tracing the tuple to the log or trace is not recommended as it could be sensitive data.

ddebrunner commented 7 years ago

@chanskw While I did create pr #287 after more thought I think we need to define what behaviour we want on error for the operator in a consistent region and autonomous region. I can write up the consistent region and maybe have a stab at the autonomous region one.

One thing I was unclear about was the mechanism that checks for something having changed in the connection configuration. Would this just be if the configuration came from an application configuration (4.2 only) or is there another mechanism that can update it?

ddebrunner commented 7 years ago

I started a writeup here: https://github.com/IBMStreams/streamsx.messaging/wiki/KafkaProducerErrorHandling

Alex-Cook4 commented 7 years ago

@ddebrunner I took a quick stab at answering one of the questions in your proposal. I need to think about the rest a bit more.

re: One thing I was unclear about was the mechanism that checks for something having changed in the connection configuration. Would this just be if the configuration came from an application configuration (4.2 only) or is there another mechanism that can update it?

The mechanism for checking for changes in the connection configuration was solely for application configuration.

@chanskw

Why do we want to reset if we have a message exception when the app config properties have not changed? Is this the right logic that app config properties affect whether the producerClient has received an exception or not?

There is no reset if the appConfig properties haven't changed. The exception was only being used for the purposes of resetting or not, so I wanted to clear the exception in the case that we weren't going to "act on it" or had already acted on it (the exception also gets cleared if the operator is reset.

chanskw commented 7 years ago

Thanks @ddebrunner ... I will try to look at this next week.

ddebrunner commented 7 years ago

@chanskw Just to note that there is a reset() method on ConsistentRegionContext.

http://www.ibm.com/support/knowledgecenter/en/SSCRJU_4.2.0/com.ibm.streams.spl-java-operators.doc/api/com/ibm/streams/operator/state/ConsistentRegionContext.html#reset--

schubon commented 3 years ago

Customers are using the (from-scratch) new toolkit https://github.com/IBMStreams/streamsx.kafka that was started from scratch. If you still encounter the issue there, open an issue over there, please.

IBMStreams / streamsx.messaging

KafkaProducer does not guarantee at least once message delivery in consistent region #284