Memory leak using query_method with large data sets

bmclean commented 9 years ago

This issue is regarding the query_method decorator and large data sets (> 1000 items).

We are experiencing high memory usage when performing queries on large data sets and limiting the results. The memory seems to be allocated for the entire set, not just the results returned. For example, a query for the first 100 entities from a dataset of 2000 will cause an extra 5 MB of RAM to be used on each request and this is never freed. However, if the last 100 entities of the 2000 are requested, no additional memory is used. If we set produce_cursors to False in the query options used in fetch_page, no nextPageToken is returned, and the behaviour is identical to requesting the last 100 entities (no additional memory used). As a fix, we've added the following option to the query_method decorator in model.py (Nov 26, 2014, commit ed91623ae07b9ff2341d420e48ed940f40552828):

Line 1435:

def query_method(cls,
                 query_fields=(),
                 collection_fields=None,
                 limit_default=QUERY_LIMIT_DEFAULT,
                 limit_max=QUERY_LIMIT_MAX,
                 user_required=False,
                 use_projection=False,
                 produce_cursors=True,
                 **kwargs):

Line 1600:

query_options['produce_cursors'] = produce_cursors
items, next_cursor, more_results = query.fetch_page(
       request_limit, **query_options)

We then call the query_method with produce_cursors=False. Is there a better solution? Or could this fix be considered for implementation?

dhermes commented 9 years ago

@bmclean Thanks for reporting this.

Are you setting limit_default? It seems strange that this would be occurring. Also it appears you have requests coming in with limit set from the query.

Can you post some sample code and a sample query / payload?

bmclean commented 9 years ago

Yes, we have limit_default at 100. Here is an example of our call:

    @Transaction.query_method(
        query_fields=('accountId', 'belowVersion', 'limit',),
        path='TransactionGetOlder/{accountId}/{belowVersion}/{limit}',
        user_required=True, name='getOlderTransactions',
        limit_default=100, limit_max=100,
        produce_cursors=True)
    def TransactionGetOlder(self, query):
        return query.order(-Transaction.entityVersion)

(@dhermes Updated to eliminate overflow)

Here is the query:

GET /_ah/api/libraAPI/v1/TransactionGetOlder/E8F5E3BD-808B-11E4-9CA3-80E6500B2C9A/0/100

Here is an example payload: https://gist.github.com/bmclean/d762d1022dd988469f68#file-payload

dhermes commented 9 years ago

@bmclean Thanks for this. I was curious if you were using a reverse query. By using query.order(-Transaction.entityVersion) you are making it so that the "first 100" are really the "last 100".

I'm not 100% about the reason for the large memory use. ndb uses 2 levels of caching, in-memory and memcache (in addition to persisting in the datastore). I think (and am checking now) that asking for the beginning of a reversed query is what is causing the issue.

For large datasets, using paging with limits is really not possible. Instead, cursors are the way to enable paging.

dhermes commented 9 years ago

OK I tested (locally) the cache and I still don't understand why 5MB is used.

from google.appengine.ext import ndb

class A(ndb.Model):
  a = ndb.StringProperty()

# 100+ entities already stored for the A model

ctx = ndb.get_context()
print 'Cache 1:', ctx._cache
A(a='zap').put()
print 'Cache 2:', ctx._cache
results, cursor, more = A.query().fetch_page(2)
print 'Cache 3:', ctx._cache
results, cursor, more = A.query().order(-A.a).fetch_page(2)
print 'Cache 4:', ctx._cache

So the above is printing the in-memory cache after a few operations.

The results are as follows (I edited the datastore IDs for viewing):

Cache 1: {}
Cache 2: {Key('A', 1): A(key=Key('A', 1), a=u'zap')}
Cache 3: {Key('A', 1): A(key=Key('A', 1), a=u'zap'),
          Key('A', 2): A(key=Key('A', 2), a=u'46'),
          Key('A', 3): A(key=Key('A', 3), a=u'40')}
Cache 4: {Key('A', 1): A(key=Key('A', 1), a=u'zap'),
          Key('A', 2): A(key=Key('A', 2), a=u'46'),
          Key('A', 4): A(key=Key('A', 4), a=u'zap'),
          Key('A', 5): A(key=Key('A', 5), a=u'zap'),
          Key('A', 3): A(key=Key('A', 3), a=u'40')}

dhermes commented 9 years ago

It took me awhile to find, but here is a reference that might help: http://stackoverflow.com/a/3566878/1068170

The fact that endpoints-proto-datastore doesn't use offset makes it a little different though.

bmclean commented 9 years ago

We just tried removing the reverse query and it didn't make a difference. One other thing we did try was to turn off both the in-memory cache and the memcache in our Transaction class:

_use_cache = False
_use_memcache = False

and we still see the memory jump up with each query.

dhermes commented 9 years ago

@bmclean Thanks for digging. If you do the query outside of endpoints-proto-datastore do you have the same issues? I'm still unclear if the issue is

endpoints-proto-datastore directly
endpoints-proto-datastore calling ndb incorrectly
ndb doing something wrong / unexpected

bmclean commented 9 years ago

I executed the following 20 times using the remote_api_stub:

 Transaction.query(ancestor=ndb.Key('Account', account_id)).fetch_page(100, start_cursor=None)

The memory of the python process started at 104.8 MB and ended at 108.1 MB.

When I make the request 20 times using the Google API explorer the memory of the python process started at 106.5 MB and ended at 199.8 MB.

dhermes commented 9 years ago

Wow, thanks for this! I am totally fine to accept your fix.

I don't fully understand why it works / what is broken without it.
I'd like to run it by someone who actually (still) works at Google.

Do you want to send a pull request?

dhermes commented 9 years ago

# Works just fine
q = Transaction.query(ancestor=ndb.Key('Account', account_id))
result, next_cursor, more = q.fetch_page(100, start_cursor=None)
# Happens in endpoints-proto-datastore
# query_info.cursor is none if not set from the request
items, next_cursor, more_results = query.fetch_page(
            request_limit, start_cursor=query_info.cursor, projection=['field1, 'field2'])

a possible culprit may be

query_info = request_entity._endpoints_query_info
query_info.SetQuery()

since these _EndpointsQueryInfo objects will also hold a (potentially large) entity. It's unclear to me why not having produce_cursors set would allow them to be GC'ed?

Can you add

import gc
potential = [obj for obj in gc.get_objects()
             if 'endpoints_proto_datastore' in type(obj).__module__]

into your code to see where the 5MB bump comes from in remote_api_stub.

dhermes commented 9 years ago

Some more potentially helpful links: https://code.google.com/p/googleappengine/issues/detail?id=9610 http://stackoverflow.com/questions/12095259/ndb-not-clearing-memory-during-a-long-request Memory profiler: http://stackoverflow.com/a/110826/1068170

bmclean commented 9 years ago

We realized that running the query from the remote_api_stub really isn't the same thing. So we added the test query to one of our webapp2 controllers:

q = Transaction.query(ancestor=ndb.Key('Account', account_id))
result, next_cursor, more = q.fetch_page(100, start_cursor=None)

and refreshed the page 20 times. This resulted in the memory climbing to 185 MB. However, if we change the query to a fetch (instead of fetch_page):

q = Transaction.query(ancestor=ndb.Key('Account', account_id))
result = q.fetch(100)

the memory doesn't climb at all.

dhermes commented 9 years ago

I read through all of https://code.google.com/p/googleappengine/issues/detail?id=9610 and I'm pretty sure this is what we have encountered.

ISTM the bug is clearly in ndb. However, I am unclear of how to address this within endpoints-proto-datastore. Your fix is nice, but will remove the ability for paging. I suppose for sophisticated users such as yourself, patching the library and using db instead of ndb in query_method will give you a way to eliminate the memory leak but also enable paging.

bmclean commented 9 years ago

We are defaulting produce_cursors to True in model.py so the default paging behaviour is unchanged. We then set produce_cursors=False in our API method decorators to explicitly disable it.

Obviously this is a workaround for an ndb bug, not an issue in endpoints-proto-datastore. Based on this behaviour, we're going to avoid fetch_page calls with cursors in our web application as well.

dhermes commented 9 years ago

How do you mean "defaulting produce_cursors to True in model.py"? Like as a module globule or defined on the classes themselves?

Also, if you don't call fetch_page, how do you get a cursor back from a query?

bmclean commented 9 years ago

We added an argument to the query decorator called produce_cursors which has a default value of True.

Line 1435 of model.py:

def query_method(cls,
                 query_fields=(),
                 collection_fields=None,
                 limit_default=QUERY_LIMIT_DEFAULT,
                 limit_max=QUERY_LIMIT_MAX,
                 user_required=False,
                 use_projection=False,
                 produce_cursors=True,
                 **kwargs):

We then use the decorator like this:

@Transaction.query_method(
        query_fields=('accountId', 'belowVersion', 'limit',),
        path='TransactionGetOlder/{accountId}/{belowVersion}/{limit}',
        user_required=True, name='getOlderTransactions',
        limit_default=100, limit_max=100,
        produce_cursors=False)
def TransactionGetOlder(self, query):
    return query.order(-Transaction.entityVersion)

We don't use a cursor for this call. We sort by an attribute called entityVersion which we pass in to the request and then use an inequality filter to return a subset of the results. This is likely not as efficient as a cursor, but works well.

dhermes commented 9 years ago

I see.

bmclean commented 9 years ago

If you think our change is useful to the project or you want to see the changes, I can generate a pull request. Just let me know. We certainly appreciate all of the help today @dhermes!

dhermes commented 9 years ago

@bmclean I think I understand your change, I just don't understand why it fixes the ndb bug, so let's hold off.

I'm closing out.

/cc @pcostell @Alfus https://code.google.com/p/googleappengine/issues/detail?id=9610 is still out in the wild.

dhermes commented 9 years ago

@josemontesdeoca FYI

This is another report and gives some sample code for the memory leak in https://code.google.com/p/googleappengine/issues/detail?id=9610

In particular, the peaks 185 MB / 199.8 MB of RAM stuck out in my memory after your comment about a peak of 300 MB.

josemontesdeoca commented 9 years ago

@dhermes Thanks for the notice! I'll definetly take a closer look

GoogleCloudPlatform / endpoints-proto-datastore

Memory leak using query_method with large data sets #122