Connection closure error inconsistency

danturn commented 5 years ago

We have a autocomplete text box on our website... as you type it sends ajax requests to the backend to lookup records in the DB.

the autocomplete library aborts http requests and as such we initially got this error whenever that happened: [error] Postgrex.Protocol (#PID<0.3243.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.3748.0> exited

This made sense as the client pid was the request process... that had been aborted so the error message was explainable

To get rid of the noise in our logs (as this was didnt appear to be an error unless im misunderstanding) we wrapped the db lookup in a task (started async no link) with the task supervisor....

this greatly improved occurrences of the error but it hasn't eliminated it completely... it seems to occur infrequently and unpredictably and i can't work out why also the client pid in the error is not the pid of the task, im not entirely sure what it is the pid of.

if we make the psql query just sleep for 10 seconds and abort it in the middle this does not occur so it seems to be an edge case where if the connection is aborted at a specific point in the connections lifecycle this issue occurs.

I'd love to know if this is expected? and if so, is there any way we can avoid logging errors regularly in our logs?

thanks!

Dan

josevalim commented 5 years ago

I can confirm the error is definitely expected but I can't say why it still happens when using a task.

Just to rule out other possibilities:

can those errors be happening during deployments?
can those errors be happening in other routes and not the task? Basically, if any client terminates the connection (imagine someone on a 3G connection that terminates) during the query, then you will see this

If the options above are not the case, can you please post some code samples of the task usage from within the autocomplete?

josevalim commented 5 years ago

@fishcakez perhaps we should decrease the log level on this one to warn or info?

danturn commented 5 years ago

Hi Jose,

Can confirm that i can repro this locally so it definitely happens outside of deployments and definitely on that single route as noone else can connect to the vm i was using, in addition I've poked around in the db_connection code and added some logging to confirm it was from where we were expecting

this is the controller code:

defmodule PlatformWeb.Candidate.SchoolsController do         
  use PlatformWeb, :controller                                           
  alias Platform.Schools     

  def search(conn, %{"term" => term}) do                                                 
    schools = Schools.search(term)    
    json(conn, schools)    
  end                                                                  
end

and the schools context code:

  def search(term) do
    term = normalise(term)
    sql = ~s|select * from search_schools('#{term}')|

    task =
      Task.Supervisor.async_nolink(:school_search_supervisor, fn ->
        SQL.query!(Platform.Repo, sql, []) 
      end)

    %Postgrex.Result{rows: rows} = Task.await(task, 1000)

    Enum.map(rows, &search_result/1)
  end

i can see if i can create a vanilla project to repro it with the js library etc. we have repro'd it just using curl and cancelling the request via a script too

josevalim commented 5 years ago

A curl example is perfect fine by me. So if you can isolate it in an app alongside the commands, I can take a further look. :D

danturn commented 5 years ago

cool, i'll have a play, thanks!

danturn commented 5 years ago

Hi Jose,

In investigating this i discovered it was in fact a different query in a plug, apologies for raising this! i'm closing it now

thanks!

Dan

danturn commented 5 years ago

Sorry, forgot to say we still have the issue that we cant suppress this logging, it would be great if this was a warning/info message

josevalim commented 5 years ago

So let's see what @fishcakez says about decreasing the severity on this one. :)

fishcakez commented 5 years ago

Sure we can decrease it to warn or info as there would be a crash elsewhere that should be logged. If not, likely shutdown exit signals propagating to process using connection. In that case it's not really an error.

josevalim commented 5 years ago

Given the above, I would go with debug. --

José Valimwww.plataformatec.com.br http://www.plataformatec.com.br/Founder and Director of R&D

fishcakez commented 5 years ago

Debug might be too low because it's possible we executed a query but never logged it due to race condition with exit signal. I think info is good.

lukebakken commented 3 years ago

in addition I've poked around in the db_connection code and added some logging to confirm it was from where we were expecting

@danturn I realize it has been a while, but if you have time would you mind providing more information on what you did to confirm the source of this log message in your environment? I've added some logging to track this down myself. I'd like to know what others have done in this scenario.

I've tried reproducing the client ... exited error using a couple test applications without success.

In this project I try to exit early during a transaction. The sasl and other log messages tell exactly what happened -

https://github.com/lukebakken/ecto-test/blob/main/testapp/lib/testapp/cli.ex#L18

With this project I put toxiproxy in between the browser and web server to close connections early. Still no error.

https://github.com/lukebakken/phoenix-test

elixir-ecto / db_connection

Connection closure error inconsistency #197