florinpatrascu / bolt_sips

Neo4j driver for Elixir
Apache License 2.0
258 stars 49 forks source link

Connection pool creates much more connections than specified. #41

Closed MachinesAreUs closed 5 years ago

MachinesAreUs commented 6 years ago

My team and I found bolt_sips has two apparently wrong behaviors regarding to connection pooling.

  1. It opens much more tcp connections than specified with the pool size configuration. Although the number of processes in the pool is correct, the number of sockets at the OS level grows almost indefinitely, limited only by the limit on the number of file descriptors the process can afford.
  2. When reaching such limit, the behavior of query execution is erratic and unpredictable. You may get a response if you wait long enough, or after a while you may get an error which makes no sense to the client application. It seems like the timeout parameter only applies to the connection between the driver and the neo4j server, but it doesn't apply to the client application requests.

You can try by yourself starting this minimal application. Just change the query/queries you want to execute in BoltSipsLoad.

After cloning, and deps compilation, try:

$ iex -S mix
ex(1)> BoltSipsLoad.load_test(5, 1, 500)

This would repeat 5 times launching 1 process to execute a query and then will wait 500ms for the next iteration. In another terminal you can check for the number of opened sockets to the neo4j server (substitute pid for watherver the process id of your beam is):

$ lsof -nP -i4TCP  | awk -v pid=$pid '$2 == pid {print $0}' | grep 7687 | wc -l
5

5 sockets to the bolt socket, as expected.

Now something more interesting. Let's make 10 iterations, launching 100 processes in each one.

ex(2)> BoltSipsLoad.load_test(10, 100, 500)
$ lsof -nP -i4TCP  | awk -v pid=$pid '$2 == pid {print $0}' | grep 7687 | wc -l
470

What?!

This is confirmed in the observer application. Look at all those tcp_inet ports

image

And each one of them is a port to the neo4j bolt port.

image

Unfortunately this is causing trouble in a production system that was just handed to me. Increasing the limit of file descriptors the process can open just moves the problem somewhere else, because the neo4j server can't handle thousands of connections without getting into performance problems. In any case, there shouldn't be there as many open connections, that's what the pool is intended for, isn't it?

It would be great to get your confirmation/feedback about this issue.

MachinesAreUs commented 6 years ago

Environment:

florinpatrascu commented 6 years ago

Hmm. Thank you for the detailed report, and I am sorry for the troubles. I’ll have to find the time to dedicate to this issue, but as a quick step forward, can you please try the driver code from the master branch?

MachinesAreUs commented 6 years ago

For some reason when I tried to update to 0.5 (and failed because it isn't on hex.pm and I didn't read about the dependency on db_connection/master issue), I assumed we were already using the db_connection pooling. Wrong. I've just switched to master and it appears to work fine on a first round.

I'll keep you informed.

Thanks for the hint!

florinpatrascu commented 6 years ago

This is good news, thanks a lot for confirming the master branch! Please keep me tuned, as we we’ll refactor the driver and add the bolt+routing protocol in addition to the bolt only one.

florinpatrascu commented 6 years ago

I’ll still look at the issue you reported!

MachinesAreUs commented 6 years ago

Well, we made some tests and deployed our application using the version in master. It's been more than 12 hr since then and everything appears to work alright 👍

florinpatrascu commented 6 years ago

W⦿‿⦿t!