dashbitco / broadway_cloud_pub_sub

A Broadway producer for Google Cloud Pub/Sub
Apache License 2.0
70 stars 24 forks source link

Getting checkout_timeout errors, hackney holding open connections indefinitely? #63

Closed sorliem closed 2 years ago

sorliem commented 3 years ago

Hello,

I am experiencing an issue where I am getting the Unable to acknowledge messages with Cloud Pub/Sub, reason: :checkout_timeout (line) error logged around the same time I get pubsub expired deadline metrics recorded in GCP.

My subscription does not get a lot of traffic and there is usually a message every 5-10 minutes most of the day, and maybe 1-2/minute during the peak times during the day. My pool options are the defaults (pool size = 2).

I am curious why recv_timeout (line) was configured to be :infinity? Wouldn't that cause the request to block for 5 minutes in my case?

My theory is that because I have so few messages being processed, the moment a message is received from pubsub, the 2 connections in the pool are checked out trying to receive more messages, and then when that one message needs to be ack'd, there is no connection available to make an acknowledge request.

The one idea I had to fix this was to add :middleware option to override the recv_timeout: [{Tesla.Middleware.Timeout, 4_000}]. Is there any advice for configuring broadway_cloud_pub_sub for subscriptions that have a very low message publish rate?

Thank you!

Env: elixir 1.10.4 erlang 22 broadway 0.6.2 broadway_cloud_pub_sub 0.6.0

josevalim commented 3 years ago

A PR that makes the value configurable is welcome. Although we should probably reduce the default to something that won’t invoke timeouts on Google’s API. Do you know if 5 minutes is the most on Google side? Or is it even less than that?

mcrumm commented 3 years ago

@josevalim @sorliem They don't publish an exact value, but anecdotally, without specifying returnImmediately, Google will close the connection after ~60s

sorliem commented 3 years ago

I might try and put up a PR to make it configurable :+1:

It seems like I'm experiencing the same thing as a previous issue, #18. Since I don't set returnImmediately: true (which is now deprecated according to Google), the message gets "stuck" because all 2 connections are checked out trying to pull new messages and probably getting the connection closed at 60s. By which time the deadline has exceeded. Does that seem accurate @mcrumm?

mcrumm commented 3 years ago

@sorliem Yes I am almost positive that is the exact problem we faced as well, which led to the current hackney adapter implementation in this package. Ultimately you need a connection pool size large enough to handle message pulls + acknowledgements.

Right now the default pool size is 2 * producer concurrency. Generally speaking I would think that enough to cover the acknowledgements, but maybe we should include the batcher concurrency in there as well? It's a bit difficult to pick a specific number of connections because messages can be acknowledged anywhere in the pipeline.

sorliem commented 3 years ago

I'm not convinced that upping the pool size would solve this though. If I upped it to 20, then I'd still have 20 connections checked out for ~60s at a time, which slightly increases the chance that the acknowledgement request in_queue would secure a connection... but it's still iffy.

What I am hoping to achieve with making recv_timeout configurable is that I can set it to 5s which forces all connections in the pool to be checked in/out more frequently so that the acknowledgement request has a much higher chance to secure a connection before the deadline expires.

mcrumm commented 3 years ago

Only the producer processes pull messages. Any of the other pipeline processes may acknowledge messages, depending on how your topology is configured and/or whether you use Message.ack_immediately/1 in your pipeline.

So increasing the pool size does not mean that all connections will be blocked by (or even involved in) receiving messages.

mcrumm commented 3 years ago

To be clear, I am saying that pool size is separate from producer concurrency.

So if you have a pipeline a with 2 producer processes:

Broadway.start_link(__MODULE__,
  name: __MODULE__,
  producer: [
    module: {BroadwayCloudPubSub.Producer, []},
    concurrency: 2
  ],
)

then by default BroadwayCloudPubSub will start a hackney pool with four (2 * 2) connection processes. Only the two producers will ever make requests to pull messages, so only two connections will be in use. There should be two connections still available for acknowledgement.

dbhobbs commented 3 years ago

If those 2 producer requests don't return then wouldn't it follow that a 3rd connection would eventually be checked out of the pool to pull messages? I'm not sure how often (by default) Broadway is requesting demand from the producers. But we're definitely seeing issues with both fetching and acknowledging messages and I have a hard time stomaching the "just-increase-the-pool-size" solution. If we have producer concurrency set to 1 then shouldn't there only ever be one open http connection pulling messages per node?

Does defaulting the hackney recv_timeout to the receive_interval make sense?

mcrumm commented 3 years ago

I'm not sure how often (by default) Broadway is requesting demand from the producers.

The producer always enqueues demand from its consumers. Basically that's just to say that the act of consumers requesting work is separated from the act of the producer requesting messages from Cloud Pub/Sub.

issues with both fetching and acknowledging messages

The primary issues is likely still with hackney. There is an ongoing effort to see the default HTTP client updated (#51, #62) at which point I would expect this to be resolved without any configuration change on your end.

Does defaulting the hackney recv_timeout to the receive_interval make sense?

I don't think so. Let me attempt to explain why:

Although we should probably reduce the default to something that won’t invoke timeouts on Google’s API

The difficult is that Cloud Pub/Sub expects us to hold the connection open so that they can close it when they are done looking for messages, and they aren't specific about their timeout. The most information provided is in the description for the returnImmediately option for a synchronous Pull request.

The GoogleApiClient implementation here allows specifying return_immediately: true. As @sorliem pointed out this option is listed as deprecated by Google, but it allows them to retain control of closing the connection.

So my hesitation around the config option is whether we want to support it beyond our use of hackey– @wojtekmach and @josevalim thoughts?