One event never acked when using handle_batch

michaelst commented 3 years ago

When using handle_batch to process messages it appears that there is always one item that won't get processed even though the pipeline succeeds. I am having this issue across a couple different consumers and it only happens when using handle_batch. We have batch size set to 1000. Let me know if there is any additional info I can provide to help debug this.

michaelst commented 3 years ago

The 1 -> 0 count yesterday was me purging the subscription so it would stop delivering the message

josevalim commented 3 years ago

Hi @michaelst! Can you please let us know your version for Broadway and this lib?

To rule out external intervention, can you try running a simple pipeline that does nothing in handle_message nor handle_batch (i.e. it should just ack the messages) and let us know if the issue exists? Meanwhile I will do some code exploration to find possible root causes.

josevalim commented 3 years ago

Also please check if you have any log messages in the terminal. If you accidentally drop messages in your handle_batch, Broadway should let you know about it.

michaelst commented 3 years ago

I checked the logs and there is nothing

  "broadway": {:hex, :broadway, "1.0.0", "da99ca10aa221a9616ccff8cb8124510b7e063112d4593c3bae50448b37bbc90", [:mix], [{:gen_stage, "~> 1.0", [hex: :gen_stage, repo: "hexpm", optional: false]}, {:nimble_options, "~> 0.3.0", [hex: :nimble_options, repo: "hexpm", optional: false]}, {:telemetry, "~> 0.4.3 or ~> 1.0", [hex: :telemetry, repo: "hexpm", optional: false]}], "hexpm", "b86ebd492f687edc9ad44d0f9e359da70f305b6d090e92a06551cef71ec41324"},
  "broadway_cloud_pub_sub": {:hex, :broadway_cloud_pub_sub, "0.7.0", "a6ebc5ca9f020024edc3fd9ae745c47cbf754b2d1247946d1d622ab26074cafd", [:mix], [{:broadway, "~> 1.0", [hex: :broadway, repo: "hexpm", optional: false]}, {:google_api_pub_sub, "~> 0.11", [hex: :google_api_pub_sub, repo: "hexpm", optional: false]}, {:goth, "~> 1.0", [hex: :goth, repo: "hexpm", optional: true]}, {:hackney, "~> 1.6", [hex: :hackney, repo: "hexpm", optional: false]}], "hexpm", "25268afe5b81b3829883c0cf448cbdf1db88e7e3edba979ceca3936d018a23ec"},

Also the consumer isn't doing anything with the messages, here is one of them for example. The trace in datadog shows the pipeline runs successfully with no errors. I also ran it through iex directly with no resulting errors.


  def handle_batch(_batch_name, messages, _batch_info, _context) do

    {:ok, _trace} = Tracer.start_or_continue_trace("reconcile-internally-initiated-ach-transactions")

    {:ok, _result} = match_bank_transfer_items_to_tranasctions()

    {:ok, _result} =
      Multi.new()
      |> match_bank_transfer_items_to_account_activity("INPROGRESS")
      |> match_bank_transfer_items_to_account_activity("FAILED")
      |> ReconciliationStorage.transaction()

    {:ok, _trace} = Tracer.finish_trace()
    messages
  end```

josevalim commented 3 years ago

@michaelst I wonder if the issue is in the batch code itself then. Please try the following:

batch_info should tell you information such as the size plus the trigger. If batch_info.trigger == :size, then batch_info.size == 1000 and length(messages) == 1000. Can you please send this metadata to your datadog or log if any of those constraints fail?

From the graphs, it seems to not happen always, but from time to time?

michaelst commented 3 years ago

Yep I can start logging that and will report back once I get some data.

This is happening every time I am getting a batch of messages.

josevalim commented 3 years ago

Oh, I see, it just does not necessarily happen frequently!

michaelst commented 3 years ago

Yes we get one batch per day, usually in the morning

michaelst commented 3 years ago

I added this log

 Logger.info("handle_batch on #{length(messages)} message(s): #{inspect(batch_info)}")

This is what I eventually end up getting, I published 11 messages. For more context we have 3 k8s pods running all connected to the same subscription in case that could be part of the problem.

|date                    |Host         |message                                                                                                                              |
|------------------------|-------------|-------------------------------------------------------------------------------------------------------------------------------------|
|2021-10-01T20:37:29.138Z|0579708a-ski7|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:34:56.614Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:33:02.254Z|0579708a-ski7|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:33:00.256Z|dff54640-m38o|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:31:16.785Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:29:28.518Z|faf1b956-nceh|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:29:26.511Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:27:49.840Z|faf1b956-nceh|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:27:47.837Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:27:46.836Z|dff54640-m38o|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:26:25.593Z|faf1b956-nceh|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:26:24.591Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:26:22.586Z|dff54640-m38o|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:25:00.934Z|dff54640-m38o|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:24:59.932Z|faf1b956-nceh|handle_batch on 3 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 3, trigger: :timeout}|
|2021-10-01T20:24:58.930Z|0579708a-ski7|handle_batch on 4 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 4, trigger: :timeout}|
|2021-10-01T20:23:43.404Z|0579708a-ski7|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:23:42.398Z|faf1b956-nceh|handle_batch on 7 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 7, trigger: :timeout}|
|2021-10-01T20:23:42.397Z|dff54640-m38o|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:22:28.567Z|dff54640-m38o|handle_batch on 3 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 3, trigger: :timeout}|
|2021-10-01T20:22:27.666Z|0579708a-ski7|handle_batch on 8 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 8, trigger: :timeout}|
|2021-10-01T20:21:14.747Z|dff54640-m38o|handle_batch on 5 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 5, trigger: :timeout}|
|2021-10-01T20:21:14.539Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:21:14.349Z|faf1b956-nceh|handle_batch on 4 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 4, trigger: :timeout}|

michaelst commented 3 years ago

It also seems strange it took so many attempts to ack everything down to 1

josevalim commented 3 years ago

Hrm... if this was a bug in Broadway (say, dropped messages), then I would some variation on the number of unacked messages. The fact it is always 1, even with three machines, start to make me think the bug is elsewhere.

Are there any messages being marked as failed?

You also said you published 11 messages but the batchers received much more than 33 messages altogether! What is your producer configuration? Are you using a stock producer? Can you try removing any producer configuration or hooks or custom clients and see if the issue exists?

michaelst commented 3 years ago

Oh I think I know what happened, I increased the ack timeout and it then acked the message

We had a 60 second timeout, but the pipeline was reporting that it finished in 20ms. So messages were getting redelivered after 60s so you can see each log message grouping with about 60s apart. Maybe the ack timeout needs to be higher than the batch timeout?

josevalim commented 3 years ago

Maybe the ack timeout needs to be higher than the batch timeout?

Yes, otherwise they will certainly be redelivered while we wait for the batch to form. Depending on the difference, we may even run into the situation where the same message is in the same batch twice? 🤔

In any case, if you want to submit a PR that adds this validation, it would most likely live here: https://github.com/dashbitco/broadway_cloud_pub_sub/blob/master/lib/broadway_cloud_pub_sub/producer.ex#L169

Or maybe improvements to the docs. :)

michaelst commented 3 years ago

Would we know from the connection what the ack timeout on the pubsub subscription is, might be good to log an error if they are the same. Or even better do we get an error back when trying to ack a message that can't be acked due to timeout that is maybe just being swallowed?

josevalim commented 3 years ago

Oh, this is a server parameter? So unfortunately I don’t think there is any validation we can do besides adding some notes to the docs. :(

michaelst commented 3 years ago

Ya that is a parameter you setup on the google side

However, do you know if we get an error back if trying to ack a message after the timeout

josevalim commented 3 years ago

I don’t think so. If it failed you would see something logged and they also don’t include this information in the successful response: https://cloud.google.com/pubsub/docs/reference/rest/v1/projects.subscriptions/acknowledge

the link above and others all say the response is empty.

josevalim commented 3 years ago

Hi @michaelst, do you want send a pull request to batch_timeout in Broadway to remember folks to make sure their server configuration has enough message visibility/timeout? Or should I go ahead and do it?

michaelst commented 3 years ago

I can, will do that now

michaelst commented 3 years ago

https://github.com/dashbitco/broadway_cloud_pub_sub/pull/65

dashbitco / broadway_cloud_pub_sub

One event never acked when using handle_batch #64