elixir-ecto / db_connection

Database connection behaviour
http://hexdocs.pm/db_connection/DBConnection.html
306 stars 113 forks source link

ConnectionError tcp send: close #144

Closed gaynetdinov closed 6 years ago

gaynetdinov commented 6 years ago

I'm not sure where I should open an issue, I hope this repo is okay.

My app is storing quite a lot of data to postgres db using Postgrex directly (without ecto). The app accumulates a 'buffer' of 1000 rows in GenServer state and then does something like this

{:ok, _} = Postgrex.transaction pid, fn(conn) ->
  for {record, type} <- buffer do
    Postgrex.query!(
      conn,
      "INSERT INTO my_table(column_1, column_2, column_3) VALUES($1, $2, $3)",
      [record.val_1, record.val_2, type]
     )
   end
end

where pid I got from Postgrex.start_link. It was working just fine on elixir 1.4.5 since ages, but after upgrading to 1.5.3 without any other changes, I started to get this error

(EXIT) an exception was raised: 
** (DBConnection.ConnectionError) tcp send: closed
(db_connection) lib/db_connection.ex:610: DBConnection.prepare_execute!/4  
(postgrex) lib/postgrex.ex:167: Postgrex.query!/4

This error happens randomly, it can be right after deploy of the app with elixir 1.5.3 or 3 hours later. Postgres logs show nothing, from postgres point of view it seems like my app just gone suddenly. The connection is direct, no pg_bouncer in between. Postgrex is started with custom timeout 60_000, no other special options (except hostname, pass, etc) are provided to start_link func.

Postgrex version is 0.13.3 and db_connection is 1.1.2. Bumping versions to 0.13.5 and 1.1.3 did not help.

I know my issue is quite vague, any help would be appreciated. Thanks for your time.

UPD after this error happens, the app seems to be running okay, remote_console is running fine, but it does not respond to any external commands (start, stop, restart), only kill -9 can stop the app.

josevalim commented 6 years ago

Two thoughts that come to my mind:

  1. Have you by any chance changed the machine you are running on? In any case, try upgrading OTP and see if it fixes it.
  2. Also, try increasing the timeout value when calling Postgrex.transaction. It may be actually a timeout error hiding as something else. :)
gaynetdinov commented 6 years ago

Ok, thanks for the tips. I have OTP 20.2 currently and I didn't know that even 20.3 is constantly getting fixes and backports. However, right now I had to rollback to 1.4.5, once I update to 1.5.3 with fresh OTP, I'll update the issue.

josevalim commented 6 years ago

Thanks @gaynetdinov! I will go ahead and close this issue for now, as it seems to be really related to the one I linked earlier. Feel free to comment here and open up a new issue if you have more information.

gaynetdinov commented 5 years ago

So I think you were right about hidden timeout somewhere, because turns out the workers which store data had 20 seconds to store, after 20 seconds these workers kill themselves with `{:stop, :normal, state}, but after migrating to the elixir 1.5.3 and erlang 20.3.8.9 the store time got ~10x times slower by some reason: with elixir 1.4.5 and erlang 20.2 I had 2-3 seconds store time, sometimes 4s, sometimes 500ms, but with elixir 1.5.3 and erlang 20.3.8.9 this store time is now 35 seconds.

screen shot 2018-09-24 at 10 58 09

Do you have any idea why this could happen?

josevalim commented 5 years ago

Can you provide more info? What is being stored? Where? Did you only change the elixir/erlang versions or deps too?

gaynetdinov commented 5 years ago

Ok, I'll try to cover as much as I can.

This issue happened when I upgraded to elixir 1.5.3. Back then I used to do simple inserts like I mentioned in the original message above

{:ok, _} = Postgrex.transaction pid, fn(conn) ->
  for {record, type} <- buffer do
    Postgrex.query!(
      conn,
      "INSERT INTO my_table(column_1, column_2, column_3) VALUES($1, $2, $3)",
      [record.val_1, record.val_2, type]
     )
   end
end

where buffer is a list of items to store (1k items).

Since that time I slightly modified this code to speed up the storing process, currently I do multi-row inserts: I build huge SQL like INSERT INTO my_table(column_1, column_2) VALUES ($1, $2), ($3, $4).... ($199, $200);. Therefore I insert with chunks of 200 items. But issue is still there for both approaches of inserting the data.

I used to have only one worker to insert data, but recently I changed that, so now I have 10 workers, they are registered in Registry and before storing I do Registry.lookup to get PIDs, then I do Enum.random/1 to choose the worker PID. But still, issue was there with one worker and it's there with 10 workers.

So in other words the issue was there when I had one worker which did 1000 simple inserts in one transaction. The issue is still the same when I have 10 of such workers which do multi-row inserts in one transaction.

I only upgraded elixir from 1.4.5 to 1.5.3 and erlang from 20.2 to 20.3.8.9. No dependencies changed. The database is postgres 10.3. The OS is 4.14.7-gentoo. The issue is reproducible from two different machines.

Also it's worth noting that the first attempt of upgrading elixir was without erlang version change, so I upgraded only elixir from 1.4.5 to 1.5.3 and had this issue. Then as you recommended I tried to upgrade both elixir to 1.5.3 and erlang 20.3.8.9 and had this issue again. Today I upgraded only erlang to 20.3.8.9 without touching elixir and I don't have issue anymore, so I believe I narrowed it down only to elixir version.

As for architecture, there is GenStage worker which saves data to db via bunch of GenServer workers.
It looks roughly like this


defmodule MyDataConsumer do
  use GenStage

  def handle_events(events, _from, state) do
    Enum.each events, fn(event) ->
      store_event(event)
    end

    {:noreply, [], state}
  end

  def handle_info(:last_event, state) do
    Process.send_after(self(), :die, :timer.seconds(20))

    {:noreply, [], state}
  end

  def handle_info(:die, state) do
    {:stop, :normal, state}
  end

  defp store_event(event) do
    {pid, _id} = Registry.lookup(:my_store_workers, "foo") |> Enum.random()
    :stored = StoreWorker.store(pid, event)
  end
end

defmodule StoreWorker do
  use GenServer

  def start_link() do
    {:ok, pid} = Postgrex.start_link(<<some credentials>>)

    GenServer.start_link(__MODULE__, {pid, []}, some_opts)
  end

  def store(pid, record) do
    :stored = GenServer.call(pid, {:store_data, event}, some_big_timeout)
  end

  def handle_call({:store_data, event}, _from, {pid, buffer}) do
    buffer = [event | buffer]

    if length(buffer) == 1000 do
      insert_data(buffer, pid)
      {:reply, :stored, {pid, []}}
    else
      {:reply, :stored, {pid, buffer}}
    end
  end

  defp insert_data(buffer, pid) do
    insert_stmt ="INSERT INTO my_table (c1, c2) VALUES"

    {:ok, _} = Postgrex.transaction(pid, fn(conn) ->
      chunks = Enum.chunk(buffer, 200, 200, [])

      Enum.each chunks, fn(chunk) ->
         # building huge SQL with placeholders
        values_placeholders = ["($1, $2)", "($3, $4)"... "($199, $200)"]
        values =
          Enum.flat_map chunk, fn(event) ->
            [event.val_1, event_val2]
          end

        sql = insert_stmt <> Enum.join(values_placeholders, ",")
        Postgrex.query!(conn, sql, values)
      end
    end
  end
end

So GenStage is consuming events and storing them. After a while the producer of this consumer sends last_event message to it indicating that this producer just sent the last chunk if events. After that this GenStage has 20 seconds in order to store these last events. My understanding is that after upgrading to elixir 1.5.3 the db inserts got much much slower by some reasons, so GenStage terminates itself in the middle storing data. Although I don't really know if that possible, I mean can GenStage be in the middle of :stored = StoreWorker.store(pid, event) which is effectively GenServer.call and during that time the same GenStage has 'time' to process another message from his mailbox and process handle_info(die) callback and terminate itself in the middle of storing data.

josevalim commented 5 years ago

So the suspicion is that the upgrade to the Elixir version is the one causing the slowdown? Can you isolate this further down? I would love to take a look at it.

gaynetdinov commented 5 years ago

Yeah, I'll try to come up with some as I understand that my previous comment is full of various of details, but still is vague and not really helpful.

josevalim commented 5 years ago

Oh, the comment is definitely helpful. We are ready for the next step, which is to try to make this reproducible. :)

gaynetdinov commented 5 years ago

Glad to inform you that this mystery is resolved. Upgrading distillery from 1.2.2 to 1.5 (which has Elixir 1.5 support according to the changelog) fixed the issue. So distillery without Elixir 1.5 support seemed to generate something weirdly wrong into release which caused this misbehaviour.

Thanks for your time and support!

josevalim commented 5 years ago

@gaynetdinov oh, I even know what it is. Distillery was hardcoding the path for protocol consolidation and Elixir v1.5 changed it. On recent Distillery versions it uses the proper API to retrieve the consolidated paths, so the issue disappears. :+1: