Keys only when L2 cache is enabled

GoogleCodeExporter commented 8 years ago

I've noticed, when I run a query, that it's doing a full entity fetch, even if 
L2 caching is enabled. This seems to negate any benefit of having the L2 cache 
enabled in the first place - the query runs for just as long, and the cost is 
the same. 

Shouldn't this be a candidate for a keys-only fetch, allowing the memcache to 
be checked before doing a datastore.get if the cache misses?

Right now I fail to see what benefit the cache offers in Datanucleus 3. I did 
_seem_ to gain some benefit in DN 2. 

My query is a simple query on a key field (not primary key). This is 
effectively an unmanaged relationship, which I implemented myself rather than 
use the new unmanaged relationship functionality in DN3. 

query.setFilter("user == userParam");
query.declareParameters(Key.class.getName() + " userParam");
results = (Collection<PlanCompletion>) query.execute(planUser.getUserId());

I don't really understand what is going on in validateResultExpression() well 
enough to determine if this should be a keys_only query, and is a bug, or if 
that is the expected behaviour.

It would seem that if only fetches by primary key will utilise the L2 cache 
properly, then it isn't of much use for the vast majority of queries.

Original issue reported on code.google.com by mar...@longhome.co.uk on 30 Nov 2012 at 9:53

GoogleCodeExporter commented 8 years ago

No change *whatsoever* has been made to the query use of the cache in GAE 
plugin v2.x (using DataNucleus v3) relative to GAE plugin v1.x (using 
DataNucleus v1). No idea where DataNucleus 2 comes into this since it was never 
used by any GAE plugin. The L2 cache is obviously also used on getObjectById 
calls, its primary use-case, giving significant benefit.

Obviously the alternate route you propose could check if no ordering/result 
(anything else needed?) and if so then do a keys-only query, try them in the 
cache, and any not present would have to go back to the datastore and get 
those. That would be doing 2 datastore ops relative to the current 1, though 
*may* be quicker on some cases. You could contribute such a thing since 
seemingly a priority on your project.

Original comment by googleco...@yahoo.co.uk on 1 Dec 2012 at 7:01

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 8 years ago

Thanks for the quick response, and thank you for the clarification. 

First off, sorry about the confusion with DataNucleus versions, I was thinking 
the old GAE plugin used DataNucleus v2, when of course it was v1. From what you 
say there must be something else I've done that has caused the increased number 
of reads. 

I could probably make changes to my application to encourage better cache usage 
- such as using the built in unmanaged relationships, though there are a lot of 
occasions where I only want a subset of the children (for example, filtered by 
date). 

I do think a keys-only fetch could be beneficial for queries when L2 cache is 
employed.   While it does result in an extra datastore op, surely underlying 
actions are the same - just split over two calls. I may be wrong, but I would 
guess that the full fetch will fetch the keys from the index first, before then 
going ahead and fetching the entities - all we'd be doing is breaking this into 
two separate calls in order to check the cache in between. Perhaps a little 
more thought is needed, as I know that the "Next" calls (which I presume is 
where a lot of the entity fetching happens) happen concurrently while the cache 
is being checked, and the client is processing records. 

I'm not sure that ordering would pose a problem. Wouldn't the keys be fetched 
in the defined order, providing the appropriate index is present? 

Of course, it would introduce an additional small operation. However, it it 
saved at least 1 read for every 7 additional small ops, then it pays off. This 
depends on workflow, so I suppose that it would ideally be configurable. 

I might have a look at implementing it. I'm completely unfamiliar with 
DataNucleus, so it will be a steep curve, but I'd like to contribute this if I 
can. 

Thanks again for the helpful response.

Original comment by mar...@longhome.co.uk on 1 Dec 2012 at 6:20

GoogleCodeExporter commented 8 years ago

Firstly you're right about ordering not being important here, was thinking of 
something else.

Secondly the easiest way to try it out is to add a query extension to enable a 
query using "keys-only", and then update the code to check for the extension 
(so by default people won't see this, and can change the default in the future 
if it gets to the point of being universally applicable). You only really need 
to know the code under com.google.appengine.datanucleus.query to do this. In 
DatastoreQuery look for DatastoreManager.QUERYEXT_EXCLUDE_FROM_TXN for an 
example of a query extension, so do the same for yours. Then under "normal 
query" you enable keysOnly when that extension is set, and finally when your 
extension is set you create an alternative to StreamingQueryResult that takes 
in the query result. 

As you saw, the iteration process is complicated since it only currently loads 
objects when next is called on the query result iterator, hence caters for 
large query results. A simple alternative would be to load all query results up 
front, firstly getting all objects present in the cache, then for all not 
present doing a single call to ds.get(keys) and then convert each one-by-one to 
a Pojo (EntityUtils.entityToPojo). Obviously loading all is "non-optimum" but 
gives a chance to see if it values further work. The method 
"ExecutionContext.getObjectsFromCache(ids)" is only present in DataNucleus core 
3.1.4-SNAPSHOT and later (not in earlier versions). Once you have all objects 
you just return a query result wrapping them.

Original comment by googleco...@yahoo.co.uk on 2 Dec 2012 at 9:04

GoogleCodeExporter commented 8 years ago

Thanks for the points. I've been having a look around and what you say makes 
sense. I see that the asynchronous fetches are handled by the datastore async 
api, so that is all dealt with. 

I though I'd just try implementing this in my application to start with, to see 
what benefit I get, and I seem to have hit a problem (and possibly found the 
reason that my reads have jumped up). I do a datastore async keys-only query in 
my code, and then loop through the results calling 
pm.getObjectById(PlanCompletion.class, e.getKey().getId()). So, this should 
take advantage of the cache if the object is in it. The code seems to be 
calling memcache.Get(), but then datastore_v3.Get() immediately after for each 
entity. 

The log for each item looks like this: 

    0:1354487700.542000 org.datanucleus.ObjectManagerImpl getObjectFromLevel1Cache: Object with id "com.rundouble.rpc.PlanCompletion:1191412" not found in Level 1 cache [cache size = 3]
    : 
    0:1354487700.600000 org.datanucleus.ObjectManagerImpl putObjectIntoLevel1Cache: Object "com.rundouble.rpc.PlanCompletion@1ad4feb" (id="com.rundouble.rpc.PlanCompletion:1191412") added to Level 1 cache (loadedFlags="[NNNNNNNNNYNNNNNNNNNNNNNNNNN]")
    : 
    0:1354487700.600000 org.datanucleus.state.JDOStateManager wrapSCOField: Object "com.rundouble.rpc.PlanCompletion@1ad4feb" (id="com.rundouble.rpc.PlanCompletion:1191412") is having the value in field "completed" replaced by a SCO wrapper
    : 
    0:1354487700.601000 org.datanucleus.ObjectManagerImpl getObjectFromLevel2Cache: Object with id="com.rundouble.rpc.PlanCompletion:1191412" taken from Level 2 cache (fields="[0, 1, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 17, 16, 19, 18, 21, 23, 22, 25, 24, 26]", version="") - represented as "com.rundouble.rpc.PlanCompletion@1ad4feb"
    : 
    0:1354487700.601000 org.datanucleus.state.LifeCycleState changeState: Object "com.rundouble.rpc.PlanCompletion@1ad4feb" (id="com.rundouble.rpc.PlanCompletion:1191412") has a lifecycle change : "P_CLEAN"->"P_NONTRANS"
    : 
    0:1354487700.601000 org.datanucleus.ObjectManagerImpl evictFromTransaction: Object "com.rundouble.rpc.PlanCompletion@1ad4feb" (id="com.rundouble.rpc.PlanCompletion:1191412") being evicted from transactional cache
    : 
    0:1354487700.601000 org.datanucleus.ObjectManagerImpl evictFromTransaction: Object "com.rundouble.rpc.PlanCompletion@1ad4feb" (id="com.rundouble.rpc.PlanCompletion:1191412") is not transactional
    : 
    0:1354487700.601000 org.datanucleus.store.connection.ConnectionManagerImpl allocateConnection: Connection found in the pool : com.google.appengine.datanucleus.DatastoreConnectionFactoryImpl$DatastoreManagedConnection@cc3e8 for key=org.datanucleus.ObjectManagerImpl@96e599 in factory=ConnectionFactory:nontx[com.google.appengine.datanucleus.DatastoreConnectionFactoryImpl@11da53]
    : 
    0:1354487700.601000 com.google.appengine.datanucleus.EntityUtils getEntityFromDatastore: Getting entity of kind PlanCompletion with key PlanCompletion(1191412)
    : 

I was wondering if it was throwing it out because of a transaction, or 
something. I did open a transaction earlier in the http request. So I've tried 
both creating a new PersistenceManager, and 
pm.currentTransaction().setNontransactionalRead(true);

I'm baffled why this keeps going to the datastore. Could this be a bug?

I'm using datanucleus 3.1.1 with the appengine plugin 2.1.1 (as shipped with 
the GAE API 1.7.3). My jdoconfig.xml is:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<jdoconfig xmlns="http://java.sun.com/xml/ns/jdo/jdoconfig" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:noNamespaceSchemaLocation="http://java.sun.com/xml/ns/jdo/jdoconfig">
   <persistence-manager-factory name="transactions-optional">
       <property name="javax.jdo.PersistenceManagerFactoryClass" value="org.datanucleus.api.jdo.JDOPersistenceManagerFactory"/>
       <property name="javax.jdo.option.ConnectionURL" value="appengine"/>
       <property name="javax.jdo.option.NontransactionalRead" value="true"/>
       <property name="javax.jdo.option.NontransactionalWrite" value="true"/>
       <property name="javax.jdo.option.RetainValues" value="true"/>
       <property name="datanucleus.appengine.singletonPMFForName" value="true"/>
       <property name="datanucleus.appengine.autoCreateDatastoreTxns" value="false"/>
       <property name="datanucleus.cache.level2.type" value="jcache"/>
       <property name="datanucleus.cache.level2.cacheName" value="jdocache"/> 
       <property name="datanucleus.appengine.datastoreEnableXGTransactions" value="true"/>
   </persistence-manager-factory>
</jdoconfig>

I've been stuck all evening trying to figure this out, but I'm at a loss. I 
don't really want to try implementing this in the plugin until I get a good 
idea how much difference it will make. 

Thanks

Original comment by mar...@longhome.co.uk on 2 Dec 2012 at 11:29

GoogleCodeExporter commented 8 years ago

Suggest you familiarise yourself with the JDO spec; getObjectById(id) is the 
same as getObjectById(id, true) which requires the implementation to validate 
the existence in the datastore of that object (regardless of whether it came 
from some cache). You can remove that validation check by calling 
getObjectById(id, false) or by setting 
"datanucleus.findObject.validateWhenCached" to false.

Original comment by googleco...@yahoo.co.uk on 3 Dec 2012 at 9:44

GoogleCodeExporter commented 8 years ago

D'oh! Worst thing is I actually read that a few months back. Sorry for the 
confusion. 

Indications seem to be that it doesn't make a huge difference in query time. 
The keys_only seems to be marginally faster when the cache is hot - I've yet to 
do tests with a cold cache. 

Cost wise, there is a huge benefit when the cache is hot, but then that was 
calculable and I think we knew that. This particular test which fetches about 
100 entities goes from a cost of 9450 to 1650 micropennies, which could be a 
big deal to someone who's biggest cost area is datastore ops, and who gets a 
high rate of memcache hits.

Original comment by mar...@longhome.co.uk on 3 Dec 2012 at 10:17

MountainClimb / datanucleus-appengine

Keys only when L2 cache is enabled #309