Givemeurcookies commented 1 year ago

Introduction

This is a bit of a bug report and a feature request. The issue where Janusgraph doesn't initialize properly and throws the 499: The traversal source [g] for alias [g] is not configured on the server. error. Which has been something that I've encountered several times over many projects and teams the last few years.

From what I've seen this most often happens when Janusgraph can't connect to the configured storage backend. But I believe it's supposed to be a general error message for when there's a configuration error that causes Janusgraph from being able to create a traversal source.

The error

The initial way you see that this error will occur is in the first logs when starting the server:

[timestamp] WARN  com.datastax.oss.driver.internal.core.control.ControlConnection - [JanusGraph Session] Error connecting to Node(endPoint=dev-scylla-client.scylla.svc/<ip>, hostId=null, hashCode=64f43f49), trying next node (ConnectionInitException: [JanusGraph Session|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (io.netty.channel.StacklessClosedChannelException))
[timestamp] WARN  org.apache.tinkerpop.gremlin.server.util.DefaultGraphManager - Graph [graph] configured at [/etc/opt/janusgraph/janusgraph.properties] could not be instantiated and will not be available in Gremlin Server.  GraphFactory message: GraphFactory could not instantiate this Graph implementation [class org.janusgraph.core.JanusGraphFactory]

Or in human friendly format Error connecting to Node(<node info>) and Graph [graph] configured at [/etc/opt/janusgraph/janusgraph.properties] could not be instantiated and will not be available in Gremlin Server.

Following errors such as Could not create GremlinScriptEngine for gremlin-groovy groovy.lang.MissingPropertyException: No such property: graph for class: Script1 Could not initialize gremlin-groovy GremlinScriptEngine as init script could not be evaluated will also occur.

The error won't actually be seen outside of logs before you try to send your first request to Janusgraph, where you will get an error such as:

WARN  org.apache.tinkerpop.gremlin.server.handler.OpSelectorHandler - The traversal source [g] for alias [g] is not configured on the server.
org.apache.tinkerpop.gremlin.server.op.OpProcessorException: The traversal source [g] for alias [g] is not configured on the server.
    at org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor.validateTraversalSourceAlias(TraversalOpProcessor.java:124)

Why it's an issue

There are many other threads related to frustration around the topic such as:

So there seems to be a common theme that people are confused by the error message, wondering how to fix it and it happens frequently. To be honest, this "bug" has been the largest headache I've had with Janusgraph and took me a long time to understand what it actually meant and how to fix it. Most of the time the fix was just to restart/delete docker and try again, I once tried to setup the traversal source in post but it of course only worked until I restarted the image. I was also very confused why sometimes the issue was persistent when starting up docker but worked if I deleted the Janusgraph volume and tried again.

Who it's affecting

First timers using Docker

It's especially a problem for users who just want to try Janusgraph the first time around using docker or docker-compose since this error will sometimes happen if the storage backend is slow to start (which can happen in a variety of situations). This might also be why some people have had issues on Apple silicone (we weren't able to use Janusgraph for development on those machines a few years ago, when nobody of us had in-depth knowledge of how to run it), since Janusgraph might start up much faster on those machines compared to the storage backend.

Kubernetes users

It's can also be considered a breaking issue when hosting on Kubernetes since if you have a cluster that is shut down and then started up with a lot of services, databases etc. since you can end up with a thundering herd problem which can cause very easily pass the default 30 sec timeout for waiting for the storage backend that Janusgraph is configured with. If this happens, the pod won't shut down since Janusgraph doesn't consider the error fatal. I understand that Janusgraph can technically work without a storage backend and the traversal source can be configured after but this is absolutely not desirable in all situations.

Error reporters

Having the front facing error be The traversal source [g] for alias [g] is not configured on the server. will just cause people to report that error on Github, instead of the actual root cause of the failure. This is because the root cause is buried in the first initial parts of the log, which in my above example is actually Error connecting to Node(endPoint=dev-scylla-client.scylla.svc/<ip>, hostId=null, hashCode=64f43f49), trying next node and Graph [graph] configured at [/etc/opt/janusgraph/janusgraph.properties] could not be instantiated and will not be available in Gremlin Server..

How to reproduce

This error is a little bit of a "heisen bug" since it sometimes occurs when starting up, but other times not. This is what makes it very annoying and introduces a lot of distrust for Janusgraph, I've had questions related to "how do we fix this if it happens in production" which is difficult to guarantee. Especially in the docker environments but the preconditions is to have a storage backend use a long time to i.e start up. I don't have a specific example on how to reproduce it but it's so common that I assume most users of Janusgraph has encountered it.

Suggested fixes

As I hinted on when writing about Kubernetes, I believe the logic of why the

[timestamp] WARN  org.apache.tinkerpop.gremlin.server.util.DefaultGraphManager - Graph [graph] configured at [/etc/opt/janusgraph/janusgraph.properties] could not be instantiated and will not be available in Gremlin Server.  GraphFactory message: GraphFactory could not instantiate this Graph implementation [class org.janusgraph.core.JanusGraphFactory]

is considered a WARN and not FATAL is probably due to the modular nature of Janusgraph and that it can be configured in post initialisation. There are probably other errors as well that can cause this error to occur later on, which is also considered WARN.

A new flag that makes failing to initialise cause a FATAL error and shut the instance down

My suggestion is do something some languages does, which is to have a "STRICT" flag, which shuts down the server on "hard errors", such as failure to initialise based on the configuration of the server. I believe the error is fine on it's own (i.e if you configure the traversal source after), but it's not something that should be appearing because of an initialisation error. An initialisation error should shut down the instance, that way you know something is wrong right away. How many doesn't run systemctl status <service> to naively assert if their service is running? The industry standard is to fail if the database can't receive requests upon initialisation.

Having the instance fail will make at least me much more relaxed about having Janusgraph run in production. If I got a configuration error, it fails. If a dependency can't be connected (due to network configuration or similar) or it is down, it fails. Then I will know and it can be restarted, either automatically by i.e k8s or manually. By having the Janusgraph instance stay up with a configuration error, an issue can go unnoticed for a long time before it's fixed.

In addition, if Janusgraph shuts down on failure to initialise, you can also ensure that all initial groovy scripts have been ran before the Janusgraph is accepting outside requests. Often those scripts are used to i.e set up indexing, which can be critical for operation.

Longer timeout as default

An additional option is to have a long(er) timeout as default to make this issue a more rare occurrence, especially on the docker builds. I believe this should be done in addition to adding a new flag.

FlorianHockmann commented 1 year ago

Thanks for this detailed write-up and I definitely agree with you here. This problem led to poor user experience as it wasn't easy to understand what the problem was when users tried to execute a traversal and only got back The traversal source [g] for alias [g] is not configured on the server.", although the server started successfully apparently.

I just wanted to check where we could make a change to simply let the startup fail if the configured graph could not be initialized and noticed that your error message comes from TinkerPop's DefaultGraphManager. JanusGraph however also has its own JanusGraphManager which is the default since version 0.6.1. I think this graph manager already behaves like you suggest here. I at least just started a JanusGraph Docker container with an invalid config (a CQL backend without having any Cassandra node running) and with this graph manager and it let the startup fail completely:

> docker run --rm --env JANUS_PROPS_TEMPLATE=cql --env gremlinserver.graphManager="org.janusgraph.graphdb.management.JanusGraphManager" -it docker.io/janusgraph/janusgraph:latest
[...]
15:29:27 WARN  com.datastax.oss.driver.internal.core.ContactPoints - Ignoring invalid contact point cassandra:9042 (unknown host cassandra)
15:29:28 WARN  com.datastax.oss.driver.internal.core.control.ControlConnection - [JanusGraph Session] Error connecting to Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=24738449), trying next node (ConnectionInitException: [JanusGraph Session|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (io.netty.channel.StacklessClosedChannelException))
15:29:28 ERROR org.apache.tinkerpop.gremlin.server.util.ServerGremlinExecutor - Could not invoke constructor on class org.janusgraph.graphdb.management.JanusGraphManager (defined by the 'graphManager' setting) with one argument of class Settings
15:29:28 ERROR org.janusgraph.graphdb.server.JanusGraphServer - JanusGraph Server was unable to start and will now begin shutdown
[...]

Unfortunately, the Docker image still uses TinkerPop's DefaultGraphManager instead of our own due to: JanusGraph/janusgraph-docker#113. I think resolving that issue will also resolve this one.

Longer timeout as default

That sounds reasonable to me. Can you please create an issue in the janusgraph-docker repo for this? If you want to contribute, then you can of course also create a PR to change the default timeout.

farodin91 commented 1 year ago

@FlorianHockmann We could change the behavior to always use the janusgraph manager, if the tinkerpop default is configured. We are already doing this in case of configured graphs.

FlorianHockmann commented 1 year ago

We could change the behavior to always use the janusgraph manager, if the tinkerpop default is configured. We are already doing this in case of configured graphs.

Don't you think that it would be enough if we would change the default everywhere to use the JanusGraphManager? If a user then explicitly configures the TinkerPop DefaultGraphManager, then they might really want that and not our one. (Although I don't see a good reason for that right now.) The Docker image is probably the only place where our manager isn't the default yet, but it's also easy to change it there.

FlorianHockmann commented 1 year ago

@Givemeurcookies: Could you please check the latest Docker image? The startup should now fail in the scenario that you described here.

Givemeurcookies commented 1 year ago

I've been running the image with the environment variable gremlinserver.graphManager: org.janusgraph.graphdb.management.JanusGraphManager for a while now in Kubernetes and it seems to have fixed the issue and crashes when expected.

I can also confirm the latest docker image doesn't require this flag. Thanks for the fast fix!

JanusGraph / janusgraph

"499: The traversal source [g] for alias [g] is not configured on the server." - Requires a better solution #3644

Introduction

The error

Why it's an issue

Who it's affecting

First timers using Docker

Kubernetes users

Error reporters

How to reproduce

Suggested fixes

A new flag that makes failing to initialise cause a FATAL error and shut the instance down

Longer timeout as default