JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.32k stars 1.17k forks source link

Add possibility to select `scylla` or `cql` dependencies in JanusGraph Server #3580

Open porunov opened 1 year ago

porunov commented 1 year ago

ScyllaDB and DataStax Java Drivers cannot be together in JanusGraph Server because they have the same classpath. Thus, the only possibility to include Scylla drivers into JanusGraph Server right now is by manually replacing DataStax Java Drivers with Scylla Java Drivers. It will be much more convenient is we add possibility to switch between cql and scylla drivers in JanusGraph Server using either environment variables or some configurations. I don't know which approach is better but I would imagine something like below.

We ship all incompatible dependencies between janusgraph-cql and janusgraph-scylla in separate folders libsScylla and libsCQL. We add an additional option to janusgraph-server.sh to include dependencies from libsScylla if there is an environment variable exists like the following: JANUSGRAPH_SCYLLA_DRIVER_ENABLED=true. Otherwise we include dependencies from libsCQL by default.

To make the feature described above we will most likely need to change janusgraph-dist flow of building JanusGraph distributions by changing the flow to something like below:

If we change the process to be like above then it should be easy for any user to switch between CQL and Scylla driver implementations for JanusGraph Server quite easy by providing an environment variable. As Docker supports passing environment variables as well it should be easy to switch drivers in JanusGraph Docker distributions as well.

The initial problem to this issue was described by @FlorianHockmann here.

This issue is a continuation of ScyllaDB Driver integration PR: #3578

cdegroc commented 1 year ago

ScyllaDB and DataStax Java Drivers cannot be together in JanusGraph Server because they have the same classpath.

Could shading help set two different classpaths for the two java-driver libraries at build time?

porunov commented 1 year ago

ScyllaDB and DataStax Java Drivers cannot be together in JanusGraph Server because they have the same classpath.

Could shading help set two different classpaths for the two java-driver libraries at build time?

This was @farodin91 's question as well. Shading would help having 2 conflicting libraries together, but I don't understand how will janusgraph-scylla work with Shading because it extends janusgraph-cql module. I guess I'm just missing the picture here, but if anyone wants to give it a try I will happily review that.

farodin91 commented 1 year ago

We probably have to shade both and build a combat layer.

cdegroc commented 1 year ago

This was \@farodin91 's question as well. Shading would help having 2 conflicting libraries together, but I don't understand how will janusgraph-scylla work with Shading because it extends janusgraph-cql module. I guess I'm just missing the picture here, but if anyone wants to give it a try I will happily review that.

I agree. Shading solves the class name conflict (PR #3740 shows a strict way to do it), but just shading the java-driver won't work, because, as you said, janusgraph-scylla and scylla-hadoop-util barely contain any code and rely on janusgraph-cql and cassandra-hadoop-util classes instead.

What we could do, on top of shading, is "duplicate" the code of janusgraph-cql/cassandra-hadoop-util for scylla, making both janusgraph-cql/janusgraph-scylla fully-independent backend implementations.

I believe there are multiple ways to achieve this, for instance:

  1. We could move the CQL classes to a shared library (this might match what @farodin91 meant by "compat layer") and have janusgraph-cql/janusgraph-scylla implement subclasses providing cql/scylla shaded classes.
  2. We could duplicate classes from janusgraph-cql/cassandra-hadoop-util to janusgraph-scylla/scylla-hadoop-util under different classpaths and have them use their shaded java-driver respectively (maven plugins could be used to avoid actually duplicating code).

I think these (rather complex?) options could work. However, I'm not exactly sure which side-effects relocating java-driver classes could have. The driver is dynamically loading some classes specified through the reference.conf file and I think it assumes that classes live in com.datastax.oss.driver... in its code.

porunov commented 1 year ago

This was @farodin91 's question as well. Shading would help having 2 conflicting libraries together, but I don't understand how will janusgraph-scylla work with Shading because it extends janusgraph-cql module. I guess I'm just missing the picture here, but if anyone wants to give it a try I will happily review that.

I agree. Shading solves the class name conflict (PR #3740 shows a strict way to do it), but just shading the java-driver won't work, because, as you said, janusgraph-scylla and scylla-hadoop-util barely contain any code and rely on janusgraph-cql and cassandra-hadoop-util classes instead.

What we could do, on top of shading, is "duplicate" the code of janusgraph-cql/cassandra-hadoop-util for scylla, making both janusgraph-cql/janusgraph-scylla fully-independent backend implementations.

I believe there are multiple ways to achieve this, for instance:

  1. We could move the CQL classes to a shared library (this might match what @farodin91 meant by "compat layer") and have janusgraph-cql/janusgraph-scylla implement subclasses providing cql/scylla shaded classes.
  2. We could duplicate classes from janusgraph-cql/cassandra-hadoop-util to janusgraph-scylla/scylla-hadoop-util under different classpaths and have them use their shaded java-driver respectively (maven plugins could be used to avoid actually duplicating code).

I think these (rather complex?) options could work. However, I'm not exactly sure which side-effects relocating java-driver classes could have. The driver is dynamically loading some classes specified through the reference.conf file and I think it assumes that classes live in com.datastax.oss.driver... in its code.

Thank you for investigating it @cdegroc ! With shading the datastax library, as you mentioned, it's not clear what side effects will it provide. Potentially side effects are not that critical and we could document that every datastax classes are shaded in JanusGraph, thus, use must provide correct shaded classes when specifying them in reference.conf file. That said, it should be verified that side effects are not critical. Another point about shading, I don't know how will it affect transitive dependencies because usually scylla-driver is several versions behind cql-driver. This is probably quite a rare case, but potentially scylla-driver and cql-driver might potentially depend on the conflicting versions of the same library. Otherwise your suggested solutions sounds great to me. I think the solution 1 (compat layer) could be a great solution, but I imagine it will be quite a refactoring work to be done in janusgraph-cql. I might be wrong, but it seems solution 2 could be easier than solution 1. I'm not sure how these Shading solutions compare to each other. I think if we try to compare these solutions then I would think about the next.

Shading solution: makes it possible to include janusgraph-cql and janusgraph-scylla together under the same classpath which provides possibility to use different JanusGraph instances to different cql or scylla backends in the same process (+). Don't require using environment variable to switch to CQL driver for JanusGraph server (+). Doesn't allow users to use later versions of CQL / Scylla drivers by defining the newer version of the driver (-). Low probability, but potential transitive dependency problems between cql and scylla drivers which may not allow to update one or another driver by it's own in JanusGraph (-). Necessity to use datastax shaded classes in the datastax driver configuration file (-, but to be verified).

Two directories with CQL and Scylla conflicting libraries for the JanusGraph Server: Ability to easily define a newer version of one or another library than those provided by JanusGraph (+). Ability to update cql or scylla libraries separately without worrying about transitive dependencies conflicts between them (+). Ability to use datastax normal classes in the datastax driver configuration file (+). Inability to use both janusgraph-cql and janusgraph-scylla under the same classpath (-). Requirement to use an environment variable to enable CQL driver for JanusGraph server (-).

So, looks like all solutions have their cons and pros, but I would say that pros in Shading solutions are more valuable than those in Libs copy / paste solution. I would like to see shading solution to be implemented (probably 2 solution because it feels easier, but 1 is great as well). That said, I imagine that Shading approach is harder to be implemented than Copy / Paste Libs approach (where we would need to change the distribution packing flow in janusgraph-dist/pom.xml ).

porunov commented 1 year ago

After #4050 is merged, next steps we can do in our ci-publish-official.yml, and ci-publish-commit.yml is change the flow to do something like:

  1. Make distribution with CQL mvn clean install -Pjanusgraph-release -DskipTests=true --batch-mode --also-make -Dgpg.skip=true. Move it somewhere.
  2. Make distribution with Scylla mvn clean install -Pjanusgraph-release -Puse-scylla -DskipTests=true --batch-mode --also-make -Dgpg.skip=true. Move it somewhere.
  3. Unzip both distributions.
  4. Make file names diff between libs folder in CQL distribution and Scylla distribution.
  5. Copy any files with matching names into the new (combined) distribution folder libs.
  6. Copy any files which don't match into cqlLibs folder and scyllaLibs folder.
  7. Change janusgraph-server.sh file to include not only libs folder but 2 folders (libs and cqlLibs). However, if the flag JANUSGRAPH_USE_SCYLLA=true is set then the using folders will be libs and scyllaLibs.

It shouldn't be hard to make a script which is doing everything above. However, it looks like when I run mvn clean install -Pjanusgraph-release -DskipTests=true --batch-mode --also-make -Dgpg.skip=true it also builds a Docker image. @farodin91 as you were adding that logic, maybe you could suggest here. Would it be possible to move Docker image building logic to a separate command? I.e. I want first to build janusgraph-....zip and janusgraph-full....zip archives, do some manipulations with those archives and only then run Docker building image which should use those manipulated archives as a source. Do you think how to modify Docker image building flow to move it to a separate command? Maybe to something like mvn clean install -Pbuild-docker-image -DskipTests=true --batch-mode --also-make -Dgpg.skip=true ?