GoogleCloudPlatform / spring-cloud-gcp

New home for Spring Cloud GCP development starting with version 2.0.
Apache License 2.0
423 stars 315 forks source link

Give users of PubSubTemplate control over conversion exception handling #887

Open SebastianSchmidtInovex opened 2 years ago

SebastianSchmidtInovex commented 2 years ago

We are currently using spring-cloud-gcp-pubsub-2.0.2 but as far as we could see, this functionality has not changed in 3.0.0. In our setup, we use the PubSubTemplate with a JacksonPubSubMessageConverter to let our Consumers directly work with deserialized POJOs which is working great.

But we figured that the handling of erroneous messages is quite problematic. Per default, every message that could not be processed successfully (or even not fast enough) will be retried automatically by the subscription. And for us, most cases of unprocessable messages are caused by simply invalid messages that will always be unprecessable, no matter how often they are retried. For these that are transformed into Java objects, we can simply handle that by logging the error (to trigger an investigation and resolution of the root cause) and returning the ACK signal in our Consumer. But when the conversion of the message fails, the framework will automatically send NACK and thus trigger the retry mechanism.

We thought about using a deal-letter-policy, but we figured that it should not be necessary to have messages been tossed around multiple services if we already know that they cannot be processed at all.

We identified the cause located in PubSubSubscriberTemplate#subscribeAndConvert where a lambda is registered as MessageReceiver that calls the Consumer with the converted payload. This MessageReceiver is called by the MessageDispatcher which will NACK the message on any failure. We found a "solution" by customizing the SubscriberFactory and having it wrap the registered MessageReceiver to catch the PubSubMessageConversionException and marking the message as ACKed. But the structure of the respective code in the library classes seem to mediate the notion that such tweeks are not intendet at that point and certainly not recommended.

Most of all, this does work for subscribeAndConvert but not for pullAndConvert, since there the caller will get the PubSubMessageConversionException and gets no chance to mark the message with ACK nonetheless.

So we are worried that our approach might not be a good idea and we have maybe overseen something important here. We would have thought that our use case should be quite common, but everything we find (or do not find) seems to contradict that.

Describe the solution you'd like If you agree with us that there is no gain in retrying non-convertible messages, then there should be an easier way to differentiate the handling of conversion errors and processing errors. Especially, there should be a means to have conversion errors logged (or given to a customizable handler) and the message to be ACKed, applicable both to subscribeAndConvert and pullAndConvert.

Describe alternatives you've considered If there is another recommended way to handle this use case, we would really appreciate some advise.

elefeint commented 2 years ago

@SebastianSchmidtInovex Thank you for the detailed report.

There may be an elegant bean overriding solution to the problem. Try extending JacksonPubSubMessageConverter (or, since it's a very small class, define your own implementation of PubSubMessageConverter) that logs error and returns. Let's say you name it ErrorLoggingJacksonPubSubMessageConverter.

Then instead of defining an instance of JacksonPubSubMessageConverter as per documentation, define a bean of your new type:

@Bean
public PubSubMessageConverter pubSubMessageConverter() {
  return new ErrorLoggingJacksonPubSubMessageConverter(new ObjectMapper());
}

Spring Cloud GCP's autoconfiguration will then pick up ErrorLoggingJacksonPubSubMessageConverter and use it everywhere message conversion is required.

Let us know if this works for you.

SebastianSchmidtInovex commented 2 years ago

Hello @elefeint,

many thanks for the fast response. This approach would really be less intrusive regarding the existing framework classes. There is one point in your suggestion that I am concerned about.

There may be an elegant bean overriding solution to the problem. Try extending JacksonPubSubMessageConverter (or, since it's a very small class, define your own implementation of PubSubMessageConverter) that logs error and returns.

I wonder "what" this custom converter should return after logging the error. It is expected to return the Java object it just deserialized but in that situation I cannot do this. The MessageReceiver code will put anything returned by the converter as payload into the registered Consumer. Would I have to ensure in every Consumer that the payload is not null? That might be a solution that is surely possible but has the drawback of affecting many more classes and has to be considered for every new consumer.

In my opinion, that call to the consumer should not be performed and for that the exception serves its purpose very well. Only the handling should be different between "conversion failed and processing didn't happen" and "conversion succeeded and processing failed" (and even in the latter case, I would assume that the share of recoverable errors where a retry makes sense is very small).

In fact, that is the core of my case here: Should NACK (and retry) really be the default behaviour for all errors when the predominant share of them cannot be fixed by automatic retries? Or is the assumption about these shares just wrong?

elefeint commented 2 years ago

That's fair. The reason I am reluctant jump to adding overloaded methods accepting both a message handler callback and an error handler callback is that the PubSubSubscriberTemplate already has too many methods, and we've had feedback that they are confusing to use already.

There is one more mitigation you can try for your use-case: extend ConvertedBasicAcknowledgeablePubsubMessage, which is a public interface, as a custom class FailedPubSubMessage, and then return an instance of it from the custom converter. This would prevent null handling issues. But yes, you are correct that in a perfect API, the consumer of a successful message should not be called at all on failure.

meltsufin commented 2 years ago

I see three main options here, all of which have already been mentioned:

  1. Use the dead-letter queue for invalid messages.
  2. Provide a custom converter that doesn't throw an exception.
  3. Provide a custom message receiver that catches the exception. This option can be implemented either by extending the template or by user.

Frankly, I would take a closer look at why you're getting so many invalid messages and maybe fix the problem upstream. Aside from that, dead-letter queue is probably the canonical solution. I'm not entirely opposed to adding an overloaded method to the template either. I would just like to see more demand for it than a single customer.

meltsufin commented 2 years ago

Also, to help with ensuring that the messages are always in the right format, you might want to look into the Pub/Sub schemas support.

https://cloud.google.com/pubsub/docs/schemas

SebastianSchmidtInovex commented 2 years ago

I understand your position and agree. Perhaps the current behaviour really fits for most users of the library and since we didn't find any other discussions regarding this issue, that seems to be the case. Maybe if there is more demand, others will comment here as well. It's up to you to decide if that justifies changes to your code base.

Regarding your response @meltsufin, our problem is not really that we are seing many invalid messages. But looking at our system currently consuming 9 subscriptions being fed by external systems (most of them with technology where JSON is rather unknown amongst the developers), we have exactly one use-case for one of these, where retries might help. So we figured that the probability for invalid payloads is much higher and we wanted the default behaviour to avoid retries.

Well, at this point we are at least confident that our use-case is not completely inappropriate and our current solution is viable. I will discuss with the team, if one of the others mentioned here might be preferable.

Thank you for the insights and the hint on the schemas. We will have a look at that more closely.

elefeint commented 2 years ago

We'll keep the issue open to gauge demand, although I'll change the title ("acking on failure" sounds a bit scary)