Closed gaynetdinov closed 6 years ago
Two thoughts that come to my mind:
Postgrex.transaction
. It may be actually a timeout error hiding as something else. :)Ok, thanks for the tips. I have OTP 20.2 currently and I didn't know that even 20.3 is constantly getting fixes and backports. However, right now I had to rollback to 1.4.5, once I update to 1.5.3 with fresh OTP, I'll update the issue.
Thanks @gaynetdinov! I will go ahead and close this issue for now, as it seems to be really related to the one I linked earlier. Feel free to comment here and open up a new issue if you have more information.
So I think you were right about hidden timeout somewhere, because turns out the workers which store data had 20 seconds to store, after 20 seconds these workers kill themselves with `{:stop, :normal, state}, but after migrating to the elixir 1.5.3 and erlang 20.3.8.9 the store time got ~10x times slower by some reason: with elixir 1.4.5 and erlang 20.2 I had 2-3 seconds store time, sometimes 4s, sometimes 500ms, but with elixir 1.5.3 and erlang 20.3.8.9 this store time is now 35 seconds.
Do you have any idea why this could happen?
Can you provide more info? What is being stored? Where? Did you only change the elixir/erlang versions or deps too?
Ok, I'll try to cover as much as I can.
This issue happened when I upgraded to elixir 1.5.3. Back then I used to do simple inserts like I mentioned in the original message above
{:ok, _} = Postgrex.transaction pid, fn(conn) ->
for {record, type} <- buffer do
Postgrex.query!(
conn,
"INSERT INTO my_table(column_1, column_2, column_3) VALUES($1, $2, $3)",
[record.val_1, record.val_2, type]
)
end
end
where buffer
is a list of items to store (1k items).
Since that time I slightly modified this code to speed up the storing process, currently I do multi-row inserts: I build huge SQL like INSERT INTO my_table(column_1, column_2) VALUES ($1, $2), ($3, $4).... ($199, $200);
.
Therefore I insert with chunks of 200 items. But issue is still there for both approaches of inserting the data.
I used to have only one worker to insert data, but recently I changed that, so now I have 10 workers, they are registered in Registry
and before storing I do Registry.lookup
to get PIDs, then I do Enum.random/1
to choose the worker PID. But still, issue was there with one worker and it's there with 10 workers.
So in other words the issue was there when I had one worker which did 1000 simple inserts in one transaction. The issue is still the same when I have 10 of such workers which do multi-row inserts in one transaction.
I only upgraded elixir from 1.4.5 to 1.5.3 and erlang from 20.2 to 20.3.8.9. No dependencies changed. The database is postgres 10.3. The OS is 4.14.7-gentoo. The issue is reproducible from two different machines.
Also it's worth noting that the first attempt of upgrading elixir was without erlang version change, so I upgraded only elixir from 1.4.5 to 1.5.3 and had this issue. Then as you recommended I tried to upgrade both elixir to 1.5.3 and erlang 20.3.8.9 and had this issue again. Today I upgraded only erlang to 20.3.8.9 without touching elixir and I don't have issue anymore, so I believe I narrowed it down only to elixir version.
As for architecture, there is GenStage worker which saves data to db via bunch of GenServer workers.
It looks roughly like this
defmodule MyDataConsumer do
use GenStage
def handle_events(events, _from, state) do
Enum.each events, fn(event) ->
store_event(event)
end
{:noreply, [], state}
end
def handle_info(:last_event, state) do
Process.send_after(self(), :die, :timer.seconds(20))
{:noreply, [], state}
end
def handle_info(:die, state) do
{:stop, :normal, state}
end
defp store_event(event) do
{pid, _id} = Registry.lookup(:my_store_workers, "foo") |> Enum.random()
:stored = StoreWorker.store(pid, event)
end
end
defmodule StoreWorker do
use GenServer
def start_link() do
{:ok, pid} = Postgrex.start_link(<<some credentials>>)
GenServer.start_link(__MODULE__, {pid, []}, some_opts)
end
def store(pid, record) do
:stored = GenServer.call(pid, {:store_data, event}, some_big_timeout)
end
def handle_call({:store_data, event}, _from, {pid, buffer}) do
buffer = [event | buffer]
if length(buffer) == 1000 do
insert_data(buffer, pid)
{:reply, :stored, {pid, []}}
else
{:reply, :stored, {pid, buffer}}
end
end
defp insert_data(buffer, pid) do
insert_stmt ="INSERT INTO my_table (c1, c2) VALUES"
{:ok, _} = Postgrex.transaction(pid, fn(conn) ->
chunks = Enum.chunk(buffer, 200, 200, [])
Enum.each chunks, fn(chunk) ->
# building huge SQL with placeholders
values_placeholders = ["($1, $2)", "($3, $4)"... "($199, $200)"]
values =
Enum.flat_map chunk, fn(event) ->
[event.val_1, event_val2]
end
sql = insert_stmt <> Enum.join(values_placeholders, ",")
Postgrex.query!(conn, sql, values)
end
end
end
end
So GenStage is consuming events and storing them. After a while the producer of this consumer sends last_event
message to it indicating that this producer just sent the last chunk if events. After that this GenStage has 20 seconds in order to store these last events. My understanding is that after upgrading to elixir 1.5.3 the db inserts got much much slower by some reasons, so GenStage terminates itself in the middle storing data. Although I don't really know if that possible, I mean can GenStage be in the middle of :stored = StoreWorker.store(pid, event)
which is effectively GenServer.call
and during that time the same GenStage has 'time' to process another message from his mailbox and process handle_info(die)
callback and terminate itself in the middle of storing data.
So the suspicion is that the upgrade to the Elixir version is the one causing the slowdown? Can you isolate this further down? I would love to take a look at it.
Yeah, I'll try to come up with some as I understand that my previous comment is full of various of details, but still is vague and not really helpful.
Oh, the comment is definitely helpful. We are ready for the next step, which is to try to make this reproducible. :)
Glad to inform you that this mystery is resolved. Upgrading distillery
from 1.2.2 to 1.5 (which has Elixir 1.5 support according to the changelog) fixed the issue. So distillery without Elixir 1.5 support seemed to generate something weirdly wrong into release which caused this misbehaviour.
Thanks for your time and support!
@gaynetdinov oh, I even know what it is. Distillery was hardcoding the path for protocol consolidation and Elixir v1.5 changed it. On recent Distillery versions it uses the proper API to retrieve the consolidated paths, so the issue disappears. :+1:
I'm not sure where I should open an issue, I hope this repo is okay.
My app is storing quite a lot of data to postgres db using Postgrex directly (without ecto). The app accumulates a 'buffer' of 1000 rows in GenServer state and then does something like this
where
pid
I got fromPostgrex.start_link
. It was working just fine on elixir 1.4.5 since ages, but after upgrading to 1.5.3 without any other changes, I started to get this errorThis error happens randomly, it can be right after deploy of the app with elixir 1.5.3 or 3 hours later. Postgres logs show nothing, from postgres point of view it seems like my app just gone suddenly. The connection is direct, no pg_bouncer in between.
Postgrex
is started with custom timeout60_000
, no other special options (except hostname, pass, etc) are provided tostart_link
func.Postgrex version is 0.13.3 and db_connection is 1.1.2. Bumping versions to 0.13.5 and 1.1.3 did not help.
I know my issue is quite vague, any help would be appreciated. Thanks for your time.
UPD after this error happens, the app seems to be running okay,
remote_console
is running fine, but it does not respond to any external commands (start, stop, restart), onlykill -9
can stop the app.