Closed Givemeurcookies closed 1 year ago
Thanks for this detailed write-up and I definitely agree with you here. This problem led to poor user experience as it wasn't easy to understand what the problem was when users tried to execute a traversal and only got back The traversal source [g] for alias [g] is not configured on the server."
, although the server started successfully apparently.
I just wanted to check where we could make a change to simply let the startup fail if the configured graph
could not be initialized and noticed that your error message comes from TinkerPop's DefaultGraphManager
. JanusGraph however also has its own JanusGraphManager
which is the default since version 0.6.1. I think this graph manager already behaves like you suggest here.
I at least just started a JanusGraph Docker container with an invalid config (a CQL backend without having any Cassandra node running) and with this graph manager and it let the startup fail completely:
> docker run --rm --env JANUS_PROPS_TEMPLATE=cql --env gremlinserver.graphManager="org.janusgraph.graphdb.management.JanusGraphManager" -it docker.io/janusgraph/janusgraph:latest
[...]
15:29:27 WARN com.datastax.oss.driver.internal.core.ContactPoints - Ignoring invalid contact point cassandra:9042 (unknown host cassandra)
15:29:28 WARN com.datastax.oss.driver.internal.core.control.ControlConnection - [JanusGraph Session] Error connecting to Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=24738449), trying next node (ConnectionInitException: [JanusGraph Session|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (io.netty.channel.StacklessClosedChannelException))
15:29:28 ERROR org.apache.tinkerpop.gremlin.server.util.ServerGremlinExecutor - Could not invoke constructor on class org.janusgraph.graphdb.management.JanusGraphManager (defined by the 'graphManager' setting) with one argument of class Settings
15:29:28 ERROR org.janusgraph.graphdb.server.JanusGraphServer - JanusGraph Server was unable to start and will now begin shutdown
[...]
Unfortunately, the Docker image still uses TinkerPop's DefaultGraphManager
instead of our own due to: JanusGraph/janusgraph-docker#113. I think resolving that issue will also resolve this one.
Longer timeout as default
That sounds reasonable to me. Can you please create an issue in the janusgraph-docker repo for this? If you want to contribute, then you can of course also create a PR to change the default timeout.
@FlorianHockmann We could change the behavior to always use the janusgraph manager, if the tinkerpop default is configured. We are already doing this in case of configured graphs.
We could change the behavior to always use the janusgraph manager, if the tinkerpop default is configured. We are already doing this in case of configured graphs.
Don't you think that it would be enough if we would change the default everywhere to use the JanusGraphManager? If a user then explicitly configures the TinkerPop DefaultGraphManager, then they might really want that and not our one. (Although I don't see a good reason for that right now.) The Docker image is probably the only place where our manager isn't the default yet, but it's also easy to change it there.
@Givemeurcookies: Could you please check the latest Docker image? The startup should now fail in the scenario that you described here.
I've been running the image with the environment variable
gremlinserver.graphManager: org.janusgraph.graphdb.management.JanusGraphManager
for a while now in Kubernetes and it seems to have fixed the issue and crashes when expected.
I can also confirm the latest docker image doesn't require this flag. Thanks for the fast fix!
Introduction
This is a bit of a bug report and a feature request. The issue where Janusgraph doesn't initialize properly and throws the
499: The traversal source [g] for alias [g] is not configured on the server.
error. Which has been something that I've encountered several times over many projects and teams the last few years.From what I've seen this most often happens when Janusgraph can't connect to the configured storage backend. But I believe it's supposed to be a general error message for when there's a configuration error that causes Janusgraph from being able to create a
traversal source
.The error
The initial way you see that this error will occur is in the first logs when starting the server:
Or in human friendly format
Error connecting to Node(<node info>)
andGraph [graph] configured at [/etc/opt/janusgraph/janusgraph.properties] could not be instantiated and will not be available in Gremlin Server.
Following errors such as
Could not create GremlinScriptEngine for gremlin-groovy
groovy.lang.MissingPropertyException: No such property: graph for class: Script1
Could not initialize gremlin-groovy GremlinScriptEngine as init script could not be evaluated
will also occur.The error won't actually be seen outside of logs before you try to send your first request to Janusgraph, where you will get an error such as:
Why it's an issue
There are many other threads related to frustration around the topic such as:
So there seems to be a common theme that people are confused by the error message, wondering how to fix it and it happens frequently. To be honest, this "bug" has been the largest headache I've had with Janusgraph and took me a long time to understand what it actually meant and how to fix it. Most of the time the fix was just to restart/delete docker and try again, I once tried to setup the traversal source in post but it of course only worked until I restarted the image. I was also very confused why sometimes the issue was persistent when starting up docker but worked if I deleted the Janusgraph volume and tried again.
Who it's affecting
First timers using Docker
It's especially a problem for users who just want to try Janusgraph the first time around using docker or docker-compose since this error will sometimes happen if the storage backend is slow to start (which can happen in a variety of situations). This might also be why some people have had issues on Apple silicone (we weren't able to use Janusgraph for development on those machines a few years ago, when nobody of us had in-depth knowledge of how to run it), since Janusgraph might start up much faster on those machines compared to the storage backend.
Kubernetes users
It's can also be considered a breaking issue when hosting on Kubernetes since if you have a cluster that is shut down and then started up with a lot of services, databases etc. since you can end up with a thundering herd problem which can cause very easily pass the default 30 sec timeout for waiting for the storage backend that Janusgraph is configured with. If this happens, the pod won't shut down since Janusgraph doesn't consider the error fatal. I understand that Janusgraph can technically work without a storage backend and the traversal source can be configured after but this is absolutely not desirable in all situations.
Error reporters
Having the front facing error be
The traversal source [g] for alias [g] is not configured on the server.
will just cause people to report that error on Github, instead of the actual root cause of the failure. This is because the root cause is buried in the first initial parts of the log, which in my above example is actuallyError connecting to Node(endPoint=dev-scylla-client.scylla.svc/<ip>, hostId=null, hashCode=64f43f49), trying next node
andGraph [graph] configured at [/etc/opt/janusgraph/janusgraph.properties] could not be instantiated and will not be available in Gremlin Server.
.How to reproduce
This error is a little bit of a "heisen bug" since it sometimes occurs when starting up, but other times not. This is what makes it very annoying and introduces a lot of distrust for Janusgraph, I've had questions related to "how do we fix this if it happens in production" which is difficult to guarantee. Especially in the docker environments but the preconditions is to have a storage backend use a long time to i.e start up. I don't have a specific example on how to reproduce it but it's so common that I assume most users of Janusgraph has encountered it.
Suggested fixes
As I hinted on when writing about Kubernetes, I believe the logic of why the
is considered a WARN and not FATAL is probably due to the modular nature of Janusgraph and that it can be configured in post initialisation. There are probably other errors as well that can cause this error to occur later on, which is also considered WARN.
A new flag that makes failing to initialise cause a FATAL error and shut the instance down
My suggestion is do something some languages does, which is to have a "STRICT" flag, which shuts down the server on "hard errors", such as failure to initialise based on the configuration of the server. I believe the error is fine on it's own (i.e if you configure the traversal source after), but it's not something that should be appearing because of an initialisation error. An initialisation error should shut down the instance, that way you know something is wrong right away. How many doesn't run
systemctl status <service>
to naively assert if their service is running? The industry standard is to fail if the database can't receive requests upon initialisation.Having the instance fail will make at least me much more relaxed about having Janusgraph run in production. If I got a configuration error, it fails. If a dependency can't be connected (due to network configuration or similar) or it is down, it fails. Then I will know and it can be restarted, either automatically by i.e k8s or manually. By having the Janusgraph instance stay up with a configuration error, an issue can go unnoticed for a long time before it's fixed.
In addition, if Janusgraph shuts down on failure to initialise, you can also ensure that all initial groovy scripts have been ran before the Janusgraph is accepting outside requests. Often those scripts are used to i.e set up indexing, which can be critical for operation.
Longer timeout as default
An additional option is to have a long(er) timeout as default to make this issue a more rare occurrence, especially on the docker builds. I believe this should be done in addition to adding a new flag.