florinpatrascu / bolt_sips

Neo4j driver for Elixir
Apache License 2.0
256 stars 49 forks source link

Timeout on all queries after 15 minutes of inactivity #67

Closed brotheroftux closed 5 years ago

brotheroftux commented 5 years ago

Thanks for your hard work on this driver.

I never ran into this issue whilst working in a development environment, since there was little downtime between subsequent queries. However, after testing out the current build on a production server, I noticed that all the query execution attempts after ~15 minutes of inactivity produce a timeout exception.

My configuration for the driver is as follows:

config :bolt_sips, Bolt,
  url: "bolt://<host>:8124",
  basic_auth: [username: "neo4j", password: "***"]

I use the driver API as it is recommended in the documentation, i.e.

Bolt.Sips.conn()
|> Bolt.Sips.query!( ... )

The Neo4j server is a Docker image with default configuration.

Any ideas on what causes the trouble?\ Thank you.

florinpatrascu commented 5 years ago

Hi @brotheroftux - thank you for your kind comments. This is weird. I have apps running on heroku for days, and they never encounter this situation. I wonder if the docker container is terminated somehow or if it is refusing the connections after some "idling" period of time. @dominique-vassard, have you encountered anything like this recently?

@brotheroftux - can you post the errors you see, from bolt.sips?

But I am curious, is the config above the exact one you're using? Obviosuly not refering to the password parameter ;)

Because if it is, then maybe your pool size is way too small?!

Tr use something like this instead:

config :bolt_sips, Bolt,
  url: "bolt://<host>:8124",
  basic_auth: [username: "neo4j", password: "***"],
  pool_size: 50,
  max_overflow: 2,
  #queue_interval: 500,
  #queue_target: 1500,
  retry_linear_backoff: [delay: 150, factor: 3, tries: 4]
florinpatrascu commented 5 years ago

also make sure the docker version doesn't require :ssl!

brotheroftux commented 5 years ago

@florinpatrascu Thanks for your quick response.

This is the stacktrace:

** (exit) an exception was raised:
    ** (Bolt.Sips.Exception) timeout
        (bolt_sips) lib/bolt_sips/query.ex:59: Bolt.Sips.Query.query!/3
        (backend) lib/moderation/validators.ex:33: Moderation.Validators.exists_check/2
        (backend) lib/moderation/validators.ex:17: anonymous fn/2 in Moderation.Validators.will_create_new_entities?/3
        (elixir) lib/enum.ex:2934: Enum.filter_list/2
        (backend) lib/moderation/validators.ex:16: Moderation.Validators.will_create_new_entities?/3
        (elixir) lib/enum.ex:1940: Enum."-reduce/3-lists^foldl/2-0-"/3
        (backend) lib/moderation_web/controllers/warning_controller.ex:24: ModerationWeb.WarningController.warnings/2
        (backend) lib/moderation_web/controllers/warning_controller.ex:1: ModerationWeb.WarningController.action/2
        (backend) lib/moderation_web/controllers/warning_controller.ex:1: ModerationWeb.WarningController.phoenix_controller_pipeline/2
        (backend) lib/moderation_web/endpoint.ex:1: ModerationWeb.Endpoint.instrument/4
        (phoenix) lib/phoenix/router.ex:275: Phoenix.Router.__call__/1
        (backend) lib/moderation_web/endpoint.ex:1: ModerationWeb.Endpoint.plug_builder_call/2
        (backend) lib/plug/debugger.ex:122: ModerationWeb.Endpoint."call (overridable 3)"/2
        (backend) lib/moderation_web/endpoint.ex:1: ModerationWeb.Endpoint.call/2
        (phoenix) lib/phoenix/endpoint/cowboy2_handler.ex:33: Phoenix.Endpoint.Cowboy2Handler.init/2
        (cowboy) c:/Users/afc20/Documents/projects/marketplace-backend/deps/cowboy/src/cowboy_handler.erl:41: :cowboy_handler.execute/2
        (cowboy) c:/Users/afc20/Documents/projects/marketplace-backend/deps/cowboy/src/cowboy_stream_h.erl:296: :cowboy_stream_h.execute/3
        (cowboy) c:/Users/afc20/Documents/projects/marketplace-backend/deps/cowboy/src/cowboy_stream_h.erl:274: :cowboy_stream_h.request_process/3
        (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

As you can see, this is quite inconclusive to say the least.

I copy-pasted your config (the pooling options etc.), will give it some time now to see if this fixes anything.

I am 100% sure the container is not terminated. The issue is fixed by just restarting the Phoenix application (which is not really convenient, as you'd guess). Also, Neo4j Browser experiences no issues I reckon (it just runs queries fine after hours of idling).

I don't think SSL is the problem either — Neo4j Browser has its connection set up without SSL and, remember, the backend app works fine, well, for the first ~15 minutes anyway.

brotheroftux commented 5 years ago

Hilarious. Then it started working again. image

Config changes made no effect. My local instance of the API server still fails.

florinpatrascu commented 5 years ago

Ugh that’s strange.

florinpatrascu commented 5 years ago

Btw what Erlang version is used with the app? Probably not the case but I’ll leave this here, for reference: db_connection/issues/127

We’ll make sure the next version will have a more verbose logging.

Also, when you get the timeouts, can you open an iex session to the prod env and check if you can run a simple cypher? And can you please check if the the Neo4j server logs are showing anything suspicious?!

Sorry for replying with more questions, but there’s not much in that timeout error that we can work with.

I’ll try to reproduce the error on my local, but so far I can’t reproduce it.

florinpatrascu commented 5 years ago

Oh and we also have this option:

config :bolt_sips,
  log: true,
  log_hex: false

But make sure the log level is set to :debug, i.e.:

config :logger, :console,
  level: :debug,
  format: "$date $time [$level] $metadata$message\n"

This will definitely show you what’s happening between the app and the server but it could be extremely verbose, not a recommended setting for production.

dominique-vassard commented 5 years ago

Hi, I didn't encountered this issue recently, but I don't have any project which idle that long, at least with the current bolt_sips version.

I'm curious then I will test on my laptop (with and without docker), see if I can reproduce the issue.

dominique-vassard commented 5 years ago

Seems to work fine with OTP 21 here

brotheroftux commented 5 years ago

Confirmed: the issue was our reverse proxy server. Sorry for the trouble.

Thanks for your responses, @florinpatrascu @dominique-vassard

florinpatrascu commented 5 years ago

Thank you, Daniel! The case and the feedback is important to us. Please feel free to share any feedback or suggestions, we’ll be happy to help!