Support for bolt+routing protocol

AO-StreetArt commented 6 years ago

Supporting a Neo4j Causal Cluster (https://neo4j.com/docs/operations-manual/3.3/clustering/causal-clustering/introduction/), support for retrieving routing tables using bolt+routing is needed (https://neo4j.com/docs/developer-manual/3.3/drivers/client-applications/#_routing_drivers_bolt_routing) (https://neo4j.com/docs/developer-manual/3.3/terminology/#term-bolt-routing-protocol).

I was going to implement cluster support as a part of the higher-level driver I built on top of this library (https://github.com/AO-StreetArt/NeoCpp). However, after some research, the architecture of Neo4j clusters makes the routing tables accessible from a Core Server, which should be utilized in order to bring cluster support in line with official drivers.

As far as I can tell, the best way to use them is retrieving those tables using bolt, and then using them with a load balancing strategy (example https://github.com/neo4j/neo4j-python-driver/blob/1.6/neo4j/v1/routing.py).

cleishm commented 6 years ago

Hi @AO-StreetArt,

As you've noted, the bolt+routing in neo4j is based on:

Connecting to a "core" server using BOLT
Sending a query (stored proc call) and retrieving the result table, which contains host+port, role, etc, for other neo4j servers in the cluster.
Disconnecting from the "core" server.
Establishing connections to appropriate servers from the list, based on the queries presented.
Updating/maintaining the list of known neo4j-servers whenever connections fail, etc.
???

Thus far, neo4j-client has just focussed on implementing the bolt protocol and handing everything needed for establishing a connection to a server, sending it queries and obtaining results, leaving the higher level details, like determining server addresses, to the developer consuming this library.

However, there's no reason support for this "higher-layer" of interaction couldn't be developed in this codebase and made available as an additional API. I just haven't had the time or motivation to do so myself :) Happy to discuss and help guide anyone interested in implementing this.

Cheers, Chris

AO-StreetArt commented 6 years ago

Thanks for the reply @cleishm , I'd be happy to implement this, would love your thoughts.

I think it makes sense to allow the consumer to handle the higher-level details, with one caveat. The actual communication with the core server to retrieve the result table. To me, it makes sense conceptually that libneo4j abstracts away the BOLT protocol, with other implementation details left to the consumer.

With that in mind, basically the changes to libneo4j would entail supporting access to the return results from the query "CALL dbms.cluster.routing.getRoutingTable({context})", as well as figuring out how to access the routing context to pass in (still need to figure out exactly where that comes from).

I don't think that libneo4j should really do much beyond that. What do you think?

cleishm commented 6 years ago

I had a couple of more requests for this lately. Shall we discuss some more and consider some work on it?

AO-StreetArt commented 6 years ago

Sounds great. Are you aiming to enable full routing support within the command line client, or just provide for retrieving the routing table?

We can get on IRC/Slack to discuss at some point if you'd like.

cleishm commented 6 years ago

Hey @AO-StreetArt: Full support, whatever that means ;-) Retrieving the routing table can already be done: it's just a cypher query to a stored proc (which is what makes it Neo4j specific, rather than anything to do with the bolt protocol).

AO-StreetArt commented 6 years ago

Sorry for the delay, I took some time today to think about this and pulled together a preliminary design. A few notes before I dump what I'm thinking:

I've mostly been working with c++11, so my C API design may not be perfect. Please comment where you think it could be improved.
This is the design for library support for the routing protocol. Of course, once this is done, the command line client will also need to be extended to use the new functionality.
I tried to base the design on the Python driver (https://github.com/neo4j/neo4j-python-driver/blob/1.7/neo4j/v1/routing.py)

Connection Pool responsible for maintaining neo4j_connection's
Support bolt+routing, should same connection pool support bolt as well?
End user can utilize neo4j_connection directly, or make calls to neo4j_connection_pool to retrieve connections for each transaction.
Pool is likely array of structs to store data, functions to access (inline with current API)
Pool should be thread-safe (what synchronization primitives to use?), or are clients expected to synchronize calls to retrieve connections?
Configurable routing strategies with round-robin default

New Function Pointers
- neo4j_routing_strategy - define functions which select a connection from a list -- Function needs to be passed neo4j_connection_pool to access pool state
New structs
- neo4j_connection_pool
New methods
- int neo4j_connection_pool_init(neo4j_connection_pool *pool)
- neo4j_connection neo4j_retrieve_connection(neo4j_connection_pool pool, neo4j_routing_strategy *strategy)
- int neo4j_release_connection(neo4j_connection_pool pool, neo4j_connection connection)
- int neo4j_release_connection_pool(neo4j_connection_pool *pool)
- int neo4j_round_robin_routing_strategy(neo4j_connection_pool *pool)

neo4j_retrieve_connection logic
- Update Routing Table -- Retrieve routing table from Neo4j Cluster Manager node -- Compare to existing connections, add/remove as is necessary -- May involve releasing existing connections, not creating new ones
- Assemble list of available connections, only those currently not in use
- Call neo4j_routing_strategy to retrieve connection info
- Generate neo4j_connection, if not existing, store in array of active connections, return
neo4j_release_connection logic
- make connection available to be returned on another request
- do not release actual connection itself
neo4j_release_connection_pool logic
- Release all active connections

cleishm commented 6 years ago

Hi @AO-StreetArt,

Thanks for making this initial foray!

I think this sounds like a good initial scope. My only comment is that I'd prefer not to expose it using a name like "connection pool", as that feels like an implementation detail. I'd probably just call it a "driver", from when you can start a session (i.e. obtain a connection) with a server.

We can also avoid the neo4j_release_connection method by overloading neo4j_close so that it returns to a pool rather than actually closing. I use OO constructs in many parts of the codebase for doing this (e.g. neo4j_close_results), and it wouldn't be hard to do the same here.

By doing that, we'd make it such that it wouldn't matter if the connection was obtained via a driver interface, or using neo4j_connect(...) directly - usage after that point would be the same.

Cheers, Chris

AO-StreetArt commented 6 years ago

Sorry for the delay, I was on vacation and have just had the time to start playing with this. I think your suggestions sound good and I'll implement them both.

Thus far, I've got a Causal Cluster up and running, and I'm playing with the code required to pull down and parse the routing table. In the course of pulling this together, I've come across another subtle difference we'll need to account for: routing read vs write transactions.

In the Causal Cluster, only the Core servers can handle write queries. Read-replicas (likely to be many more of these in the cluster) are only capable of executing read queries. This means that, when we look at the routing table, we're not just looking at a list of addresses, but a list of addresses and their corresponding role. For example, here's the output of the getRoutingTable procedure in my Causal Cluster (from the Neo4j Browser):

[{ "addresses": [ "localhost:7687" ], "role": "WRITE" }, { "addresses": [ "localhost:7687" ], "role": "READ" }, { "addresses": [ "localhost:7687", "localhost:7687", "localhost:7687" ], "role": "ROUTE" }]

I think the correct design is to have two methods for retrieving connections, one to retrieve a read-only connection and another to retrieve a write connection. So, we'd replace neo4j_retrieve_connection() with neo4j_retrieve_write_connection() and neo4j_retrieve_read_connection(). What are your thoughts?

cleishm commented 6 years ago

Hey!

Before you get too far ahead, it'd be good to share your code and get some feedback. I'd hate to ask to rework a lot after you'd written a significant amount. Do you have a branch you're working to? Can you do some incremental steps?

As for the specific question, I agree you'll need to differentiate the type of connection you need. However, rather than embed that in the function name, I'd follow the previous statement: make it a "type" argument. In this case, I'd use an int and add a couple of #defines that bitmask together, so you can retrieve any connection with a type that matches that mask.

cleishm commented 6 years ago

Here's the API I'd suggest:

neo4j_driver_t *driver = neo4j_driver("bolt+routing://....", config, flags);
neo4j_connection_t *connection = neo4j_session(driver, NEO4J_WRITE_SESSION);
neo4j_result_stream_t *results =
            neo4j_run(connection, "RETURN 'hello world'", neo4j_null);

(obviously with error handling and cleanup in addition)

AO-StreetArt commented 6 years ago

I have a fork that I'm working on, but not much coding done there yet. I'm still trying to get a feel for the required steps by playing with an actual cluster.

You can find my docker-compose.yml file to start a Causal Cluster here: https://gist.github.com/AO-StreetArt/8dd744b9acdf3a9458006ed50abc899e

You can find a basic script which uses libneo4j-client to view the routing table here: https://gist.github.com/AO-StreetArt/047a2b254e5e1f0f3ca685fe1a03dbfc

I like the API you outlined, and I'd be happy to do some incremental work and let you review.

cleishm / libneo4j-client

Support for bolt+routing protocol #26