Parallel ndb transactions corrupt global datastore connection

GoogleCodeExporter commented 9 years ago

When running asynchronous transactions in parallel, ndb corrupts the 
request-global google.appengine.api.datastore _thread_local.connection 
variable.  This doesn't affect ndb calls, but kills old db calls (and App 
Engine APIs that depend on the old db, e.g. taskqueue) either because they 
attempt to use a finished transaction or because the connection has an 
ndb-specific adapter.  ndb attempts to capture and restore the global 
connection in ndb.Context.transaction but this is ineffective since when 
running transactions in parallel all but the first one will capture the 
previous transaction's connection instead, and "restore" it at the end.

The code snippet below demonstrates the issue when run from an interactive 
console on a production server (1.7.2).  It fails intermittently when the 
/admin/interactive/execute handler attempts to generate the XSRF token (for 
which it needs to read an _AhAdminXsrfToken_ old db model).  The code never 
fails with a single transaction, and fails more often as you add more 
transactions.  When it fails, you can see that it logs a 
TransactionalConnection prior to crashing.

A workaround is to manually clear the connection prior to executing any old db 
code.

import logging
from google.appengine.api import datastore
from google.appengine.ext import ndb

class TestNdbModel(ndb.Model):
  pass

@ndb.tasklet
def txn():
  yield TestNdbModel().put_async()

@ndb.synctasklet
def manytxn():
  # add more transactions to taste
  yield ndb.transaction_async(txn), ndb.transaction_async(txn)

manytxn()
logging.debug(datastore._thread_local.connection)
# Uncomment to work around the issue
# datastore._SetConnection(None)

Original issue reported on code.google.com by pi...@knowlabs.com on 21 Sep 2012 at 5:47

GoogleCodeExporter commented 9 years ago

Thanks for catching this! I think the solution will have to be to save/restore 
the old datastore's connection around every yield in the transaction() call in 
context.py. What do you think?

Original comment by guido@google.com on 21 Sep 2012 at 5:58

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Yes, I think that ought to do it.  Looking forward to the fix in 1.7.3.  Thanks 
Guido!

Original comment by pi...@knowlabs.com on 21 Sep 2012 at 6:04

GoogleCodeExporter commented 9 years ago

Looking into this. (But issue 210 has priority.)

Original comment by guido@google.com on 25 Sep 2012 at 12:31

Changed state: Started

GoogleCodeExporter commented 9 years ago

This issue was closed by revision 49857afed4df.

Original comment by guido@google.com on 25 Sep 2012 at 5:49

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

My app still encounters the same error when multiple async transactions were 
run in parallel, runtime python2.7, GA version is 1.7.3.  The error happens 
with 60 async transactions were run in parallel, but the error does not occur 
all the time.  Is there a limit on how many parallel async transactions can be 
run in parallel?

 The error was:

 suspended generator _get_tasklet(context.py:266) raised BadRequestError(The referenced transaction has expired or is no longer valid.)
W 2012-12-13 10:00:52.967 suspended generator get(context.py:667) raised 
BadRequestError(The referenced transaction has expired or is no longer valid.)
W 2012-12-13 10:00:52.967 suspended generator get(context.py:667) raised 
BadRequestError(The referenced transaction has expired or is no longer valid.)

Original comment by zhongt...@gmail.com on 14 Dec 2012 at 2:08

GoogleCodeExporter commented 9 years ago

We've occasionally seen this error as well under high transactional load.  I 
don't know if it's the same root cause or not, though -- I haven't been able to 
narrow down the circumstances in which it occurs.  Anecdotally, in one case 
removing logging statements from the code made the error stop occurring, so 
there might well be some ndb vs old code interference issues left.

Original comment by pi...@knowlabs.com on 14 Dec 2012 at 2:24

GoogleCodeExporter commented 9 years ago

We (repcore-prod) are also seeing this "The referenced transaction has expired 
or is no longer valid" fairly frequently in our logs (post-1.7.4)

Is there something we can do to alleviate? Reduce the number of RPCs we are 
issuing in parallel?

Original comment by jcoll...@vendasta.com on 20 Dec 2012 at 2:51

GoogleCodeExporter commented 9 years ago

We've learned that datastore transactions have a limited lifetime:  you get 15s 
for free, then the transaction times out after 15 consecutive seconds of 
idleness (i.e. minimum life is 30 seconds), and is always killed after 60 
seconds.  So while there's no nominal upper bound on the number of parallel 
transactions, in practice CPU contention and ndb's unfair scheduling algorithm 
will prevent you from going too high.  I posted a monkeypatch on Stack Overflow 
for recovering from expired transactions which can help a bit:  
http://stackoverflow.com/a/14268276/1952074

Original comment by pi...@knowlabs.com on 10 Jan 2013 at 10:20

alltyme / appengine-ndb-experiment

Parallel ndb transactions corrupt global datastore connection #209