Open balegas opened 1 month ago
@icehaunter can you leave here the same picture as above with the new default pool size configuration? Let's measure the improvement to decide on further action
This is (incomplete) run after #1503
It still falls over relatively soon, is it now falling over because of the snapshot timeout?
No, actually, it seems like it's another pool-related config option. The actual error is
Postgrex.Protocol (#PID<0.2568.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2664.0> timed out because it queued and checked out the connection for longer than 15000ms
I'm looking into how to tweak this
@icehaunter That's not pool-related, it's a query running for longer than the default query timeout.
@icehaunter I encountered something similar. I think the following code fixed it:
def repeatable_read_transaction(fun) do
transaction(
fn ->
query!("SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY", [])
query!("SET LOCAL statement_timeout = 0")
fun.()
end,
timeout: :infinity,
ownership_timeout: :infinity
)
end
so adding ownership_timeout: :infinity
to the transaction opts and then, just in case, SET LOCAL statement_timeout = 0
within the transaction.
TBH I found the ref to ownership_timeout
in some random support thread but failed to trace its use anywhere in the ecto, postgrex or db_connection sources, so it may be that I hit a heisenbug and then attributed the solution to the above change. Worth a try though. The 15_000 ms timeout is pool related I think, as pg defaults to an infinite statement timeout A value of zero (the default) disables the timeout.
@magnetised
ownership_timeout
is specific to the DBConnection.Ownership
pool which one can use in place of the default DBConnection.ConnectionPool
.
timeout
is an option you pass to Postgrex.query()
. It is set to 15000
by default. The reason the error message in the log originates in DBConnection
is because Postgrex
forwards the timeout
option to DBConnection
which then starts a timer that determines how long a given query can be holding a connection for.
This picture shows the max number of concurrent initial syncs per shape size. The takeaway is that when shape is large enough, we can't handle the next request because there are no more connections available in the Pg connection pool, because they are all busy.
We're going to tune the initial pool size to be better fit with the number of requests the server can handle.