JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.34k stars 1.18k forks source link

gremlin_python.driver.protocol.GremlinServerError: 500: Could not start new transaction #3159

Open homealim2012 opened 2 years ago

homealim2012 commented 2 years ago

When the slow query times out, gremlin-console will display Evaluation exceeded the configured 'evaluationTimeout' threshold of 30000 ms or evaluation was otherwise cancelled directly for request [g.V().count()]: null - try increasing the timeout with the :remote command, the python will display gremlin_python.driver.protocol.GremlinServerError: 500: Could not start new transaction. Almost all subsequent operations will show that a new transaction cannot be started. However, such an exception mechanism will lead to the collapse of the graph database, because it is caused by user operations. So how to change the configuration so that the operation between users will not be affected, or change the exception mechanism of the server.

li-boxuan commented 2 years ago

Almost all subsequent operations will show that a new transaction cannot be started

Can you elaborate on this, and ideally give an example to reproduce?

Enprogames commented 11 months ago

I've been dealing with this issue too. It mainly occurs after I've started a huge operation that fails. Then the database reports that it is unable to rollback the transaction, and will now continually be unable to perform any new operations, hence your error.

I have the below code that I used to trigger this:

import os
from tqdm import tqdm
from dotenv import load_dotenv

from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.process.graph_traversal import __, GraphTraversalSource
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection

load_dotenv()

GRAPH_DB_HOST = os.getenv("GRAPH_DB_HOST")
GRAPH_DB_USER = os.getenv("GRAPH_DB_USER")
GRAPH_DB_PASSWORD = os.getenv("GRAPH_DB_PASSWORD")
GREMLIN_SEVER_PORT = os.getenv("GREMLIN_SEVER_PORT")
GRAPH_DB_URL = f"ws://{GRAPH_DB_HOST}:{GREMLIN_SEVER_PORT}/gremlin"

g: GraphTraversalSource = traversal().withRemote(
    DriverRemoteConnection(GRAPH_DB_URL, 'g',
                           username=GRAPH_DB_USER,
                           password=GRAPH_DB_PASSWORD)
)

# clear database
print("Clearing vertices")
if g.V().count().next() > 0:
    g.V().drop().iterate()
    print("All vertices cleared.")

vertices_to_add = 2_000_000
chunk_size = 100

chunk_count = 0
my_traversal = g
progress_bar = tqdm(total=(vertices_to_add // chunk_size) + 1)
progress_bar.set_description("Adding vertices")

# add vertices in chunks of size chunk_size
for i in range(vertices_to_add):
    my_traversal = my_traversal.addV('person').property('id', i)

    chunk_count += 1

    if chunk_count == chunk_size:
        my_traversal.iterate()
        my_traversal = g
        chunk_count = 0
        progress_bar.update(1)

# add any remaining vertices
if chunk_count > 0:
    my_traversal.iterate()

progress_bar.close()

# clear database
print("Clearing vertices")
if g.V().count().next() > 0:
    g.V().drop().iterate()
    print("All vertices cleared.")

First, it creates 2M vertices. Then, it tries deleting all of them. If deleting the vertices times out or fails, and the database fails to rollback the transaction, it seems to leave the database in a corrupted state.

I discovered that one solution is potentially to increase the stack memory size, as mentioned on this stackoverflow post. And another solution is to increase the evaluationTimeout by adding these two lines:

g = g.with_('evaluationTimeout', 3600000000)
g = g.withStrategies(*[TraversalStrategy(
    'OptionsStrategy', {'evaluationTimeout': 3600000000},
    'org.apache.tinkerpop.gremlin.process.traversal.strategy.decoration.OptionsStrategy'
)])

I also tried the steps in the failure and recovery section of the documentation, but I could not get this to work:

recovery = JanusGraphFactory.startTransactionRecovery(graph, startTime, TimeUnit.MILLISECONDS);

I believe this is the intended mechanism for recovering from failed transactions.

My own solution is to just be very careful about not creating massive transactions.