apache / openwhisk-package-kafka

Apache OpenWhisk package for communicating with Kafka or Message Hub
https://openwhisk.apache.org/
Apache License 2.0
32 stars 43 forks source link

Handle Action Error #271

Open pneuschwander opened 6 years ago

pneuschwander commented 6 years ago

Hello guys, how can errors be handled when using messageHubFeed as a trigger for an openwhisk action?

Let's take the following example scenario: TopicA contains the messages: M1, M2, M3, M4, M5

The openwhisk action Action1 is bound to a trigger for TopicA.

Action1 persists messages in Cloudant.

The trigger is sucessfully fired with {"messages": [M1, M2, M3]}. Now assume that Cloudant is unavailable or the action crashes/fails.

As far as I know, the offset has already been commited, so these messages won't ever be redelivered/retried. And maybe following trigger/action invocations (in case of cloudant being down for let's say 5 minutes) may end the same.

So to sum up: If the action fails, messageHubFeed ignores that and fires the trigger for the next messages. Whether they can be processed or not. In worst case all messages get delivered but never successfully processed. In such a case it would be nice to pause the delivery until the action can process the messages again.

"Messages can't currently be processed, it is not good to deliver more of them, let's queue them up (kafka can do this) and try to continue delivery in 5 Minutes".

For sure I understand that a poisoned message should not halt the processing and may be skipped. But what can we do in such a "Database is down"-scenario?

Can/Should the processing be paused?

Do we need to monitor the activation records and manually resolve all the failed ones?

Should all affected messages be sent to a Dead-Letter-Queue/Topic? And what if that fails, too (timeout, network partitioning, ...)?

Does anyone have some ideas or experience on how to deal with that kind of scenarios?

jberstler commented 6 years ago

@regmebaby Right now, whoever fires the trigger actually gets no feedback at all about whether any connected actions even run, let alone if those actions succeed. This is as-designed to keep trigger firing as lightweight and quick as possible. As such, the kafka/message hub event provider can't automatically know that your actions failed and to skip backwards to re-fire triggers for those messages.

On top of that, as far as I know, it is currently not possible to pause an event provider to stop if from firing triggers. But even if that were possible, there is also no way to tell the Kafka/Message Hub trigger to rewind and re-fire for a specific message or offset.

So... what to do in this situation? If you need to guarantee that every message is processed, I think you will need to handle it in your action.

One way to handle this situation is to persist somewhere (Cloudant? Reddis?) information about the messages that failed processing. You could persist either the entire message contents or, perhaps, just the topic and offset for the message as this is contained in the trigger payload sent to the action handling the messages. In either case, you could then have a periodic trigger that fires every so often to retry processing on messages that need it. This trigger would fire an action that:

  1. Examines your persisted store of messages that failed processing
  2. Attempts to process them by invoking the right action(s)
  3. If successful, removes the message from the store (or marks it as being successfully processed)
  4. If processing fails, it leaves those messages around for another retry, the next time the periodic trigger fires.

Should all affected messages be sent to a Dead-Letter-Queue/Topic?

I believe this has been discussed at some point, but only for scenarios where the trigger fails to fire. There is no way to make the event provider do this for you when trigger successfully fires, but the triggered actions fail.

I hope this helps.

HunderlineK commented 5 years ago

@jberstler There are many scenarios that it will not be possible for the actions to log the failed messages e.g. an out-of-memory exception will terminate the action process before it even starts, a network connection issue will prevent the action from initiating, etc.

Without support for at-least-once delivery, this package cannot be used for any use case where data integrity is critical.

Even a mediator which receives all the messages from the event and then compares them to the messages successfully processed by the actions is not completely reliable, as the messages might fail to reach the comparison queue of the mediator for the same reasons that they will fail to reach the logging of the actions.

Basically, an at-least-once delivery mechanism is necessary to use the package for any use case that requires data integrity.