florinpatrascu / bolt_sips

Neo4j driver for Elixir
Apache License 2.0
258 stars 49 forks source link

Timeout sometimes after a lot of queries #46

Closed yoelfme closed 6 years ago

yoelfme commented 6 years ago

Hello, in the latest weeks I was working in a project using Bolt Sips to connect to Neo4j, and today I deployed my service to production but I'm experience some issues, I'm receiving a lot of requests and doing almost 5 queries to Neo4j per request, but sometimes that queries give me a timeout and then my requests answer with an 500 status code.

This is the error:

7/10/2018 11:52:29 AM16:52:29.337 [error] #PID<0.2632.2> running UsersAPIWeb.Endpoint terminated
7/10/2018 11:52:29 AMServer: people-service:4000 (http)
7/10/2018 11:52:29 AMRequest: POST /api/bots/MTCenter/users
7/10/2018 11:52:29 AM** (exit) exited in: :gen_server.call(:bolt_sips_pool, {:checkout, #Reference<0.2148655786.3957063682.261483>, true}, 5000)
7/10/2018 11:52:29 AM    ** (EXIT) time out
7/10/2018 11:52:29 AM16:52:29.366 request_id=6qd7pdbvcenep2evgp0dic4uic8vnb0j [info] Updating whatsapp status of user: 7228977818 with: %{"input" => "7228977818", "status" => "valid", "wa_id" => "5217228977818"}
7/10/2018 11:52:29 AM16:52:29.429 [error] #PID<0.2705.2> running UsersAPIWeb.Endpoint terminated
7/10/2018 11:52:29 AMServer: people:4000 (http)
7/10/2018 11:52:29 AMRequest: POST /api/bots/coppel/users/update-last-interaction
7/10/2018 11:52:29 AM** (exit) exited in: :gen_server.call(:bolt_sips_pool, {:checkout, #Reference<0.2148655786.3957063682.261495>, true}, 5000)
7/10/2018 11:52:29 AM    ** (EXIT) time out

Environment:

And my bolt sips configuration is:

# Set configuration for Neo4j with Bolt
config :bolt_sips, Bolt,
  hostname: "${NEO4J_HOST}",
  port: 7687,
  basic_auth: [username: "${NEO4J_USER}", password: "${NEO4J_PASSWORD}"],
  pool_size: 10,
  max_overflow: 5,
  retry_linear_backoff: [delay: 150, factor: 2, tries: 3]

My application supervisor has this childrens

children = [
  # Start Neo4j connection
  worker(Bolt.Sips, [Application.get_env(:bolt_sips, Bolt)]),
  # Start the endpoint when the application starts
  supervisor(UsersAPIWeb.Endpoint, [])
]

It would be great to get your feedback about this issue.

Thanks for your help

florinpatrascu commented 6 years ago

Hi @yoelfme,

The 500 error, is returned by the controller, as a result of the timeout you see above.

That timeout error is happening because the db_connection is trying to checkout a new socket, but it can only wait 5_000 (5 seconds!), after this the call exits. It could be the connection is blocking trying to handshake with the db? Do you monitor the db? If yes, maybe you noticed any anomalies? I'd reluctantly suggest to increase the :pool_timeout option in the config, to 7_000 let's say?! But this will only put more pressure on the pool if the db/network/etc is at fault?! Also increasing the pool size may help, if you have lots of requests (per second). I'd start with the latter.

However, you didn't specify the OTP version?! It could be that you're experiencing this: db_connection/issues/127??

Let me know how it goes, but I believe it a slow handshake at play, unless the issue above.