Too many single reads before batch update - performance penalty

snazy commented 11 years ago

When persisting entities, PlayORM executes a bunch of "multiget_slice" requests against Cassandra.

Each request consists of two IP packets to the server - and two IP packets from the server. Means: 4 packets for a single key.

If a lot of rows is being persisted, this results in a big penalty caused by the huge amount of server roundtrips.

Please consider checking for key existance by checking for multiple keys at once - instead of performing existance of keys one-after-another.

deanhiller commented 11 years ago

Are you sure this is not an astyanax ticket? mgr.flush eventually calls CassandraSession.sendChangesImpl which calls persist, remove, persisteIndex and removeIndex and removecolumn. Those methods do nothing except build up a mutation which on line 137 of CassandraSession is called to execute and is only a write. There should be no reads whatsoever. Can you put a debug point at 137 and step over and see if that line is causing it? If so, that is an astyanax bug I believe then and needs to be filed with them? (They fixed the other bug I reported actually but it takes them a bit sometimes though this one should be a high priority I would think).

deanhiller commented 11 years ago

oh, are you talking about when persist is called or when flush is called? persist should neither write nor read from server at all. If you are talking about persist, that is definitely a playorm bug. If you are talking about flush, I suspect astyanax then as I see no reads taking place.

snazy commented 11 years ago

It happens when flush is called. I'll retry that tomorrow and debug into the code. I just saw that performance degraded when the code ran against a remote cluster instead of the local machine (which naturally has nearly no latency).

snazy commented 11 years ago

It is PlayORM.

Stack trace is like this (from top to bottom)

com.alvazan.orm.layer0.base.BaseEntityManagerImpl#put(Object)
com.alvazan.orm.layer0.base.BaseEntityManagerImpl#putImpl(Object,boolean,MetaClass)
com.alvazan.orm.layer0.base.BaseEntityManagerImpl#find(Class,Object)

in put(Object):

MetaIdField.isAutoGen() returns true (although IDs are assigned - but uses a NoSqlConverter for the ID)
needRead is set to true - this forces putImpl() to call find, which in turn produces the huge amount of roundtrips

deanhiller commented 11 years ago

so basically on the put not on the flush. Hmmmmm, I had not completely thought that one through. You are just inserting entities, correct? (not updating). There is a put(Object, boolean) where if you do put(entity, true) it should not do read. I am thinking we should reverse this so the default behavior of put is never to read but put(entity, false) will incur a read ONLY if needed (ie. if the entity was already read, it would not incur a read).

ps. this is all for indexing...to properly update indexes on an update, we have to run one delete index val column and one add index val column. If anyone does an update without reading the entity first, the index will only add the new value and not delete the old one.

snazy commented 11 years ago

I just call put(Object) (not put(Object,boolean)) because I do not know whether the row existed before. There are no indexes in these entities.

The "needRead==true" is caused, because I just used the @NoSqlId annotation without any parameters. This implies that the "NoSqlId.usegenerator" is true (the default). It's nice, that the ID is auto-generated by default. But the find-call doesn't make sense, if the ID is set and there are no indexes to update.

I've changed my @NoSqlId annotations to use "usegenerator=false" and the roundtrips are gone.

deanhiller commented 11 years ago

hmmm, this if statement might be wrong in the put method. Let me talk out loud and together we can probably work this out. The idea is

If entity NOT a NoSqlProxy, we know you did not read from noSQL already so you created the entity yourself and that might be an insert AND might be an update...we don't know which.

that said a simple if(!(entity instanceof NoSqlProxy)) breaks alot of tests that expect an autogenerated id(but the id would be null during a read) or anotherwords IF you are autogenerated AND you are null, we know this must be an insert not an update.

id is allowed to be null IF autogen=true
If autogen=false and id is null, insert will fail anyways so we don't care about this case
if autogen=false and id is not null, AND we are not NoSqlProxy, we need a read to see if it is an update or insert.

A decent test case to step through is TestManyToOne.testIndexWithToOne and remove the "idField.isAutoGen() && id != null" and that test case breaks.

There is another way though too. If you do a batch read first passing in 20 entity ids, it reads in parallel not in series which is way faster. A list is returned where N of the 20 entities exist and M do not exist where N+M = 20 and as you iterate over the 20, you can grab the ones you read and create new ones and do a batch put then which will avoid all the serialized reads which can be a penalty.

deanhiller commented 11 years ago

oh, and let me know what you think before I proceed with anything here.

snazy commented 11 years ago

The point is what you say: "that might be an insert AND might be an update...we don't know which"

That's the reason why JPA/Hibernate have a persist() and a merge() method. The merge() method allows to persist the state of an unmanaged object in the database.

usegenerator==false ....ID in object set ........instanceof NoSqlProxy -> must be an update ........! instanceof NoSqlProxy -> seems to be an insert - but might be an update.... ....null ID in object set ........instanceof NoSqlProxy -> must fail ........! instanceof NoSqlProxy -> must fail usegenerator==true ....ID in object set ........instanceof NoSqlProxy -> must be an update ........! instanceof NoSqlProxy -> seems to be an insert - but might be an update.... ....null ID in object set ........instanceof NoSqlProxy -> must fail ........! instanceof NoSqlProxy -> generate ID, insert

I think the current implementation is failsafe - that's generally fine :-)

Regarding the two "seems to be"-cases - maybe it should be sufficient do make an object manually "managed" by passing in the "unmanaged" object and return (the possibly already cached) "managed" object. (managed = instanceof NoSqlProxy). This would require a new method in NoSqlEntityManger like "public T NoSqlEntityManager.manage(T unmangedObject)".

snazy commented 11 years ago

I think in most use cases, people read objects from the entity manager, modify them and write them back using put(). The use case to provide an unmanaged object to execute an update should be rate - so a "manage()" method should be fine.

But that method should take care on relations in the object to get managed.

And the put() methods should persist new relation objects, too (currently it does not, I think).

deanhiller commented 11 years ago

I am slammed on the business side right now(a good thing) so may not get to this for a bit but it sounds like you could really use it? If so, I will merge the pull request you need immediately.
thanks, Dean

deanhiller commented 11 years ago

oh, the cascade option of hibernate has always been a highly debated topic as it has been very bug prone and hard to make it all work right. Some models have infinite span so you can't exactly cascade everything and telling it how to cut the model and where is always an issue especially when you go to change something. That should be a separate ticket and we really need to think hard on that. A case and point would be a user has a List bought and a List designed and a Tshirt has a private User designer and a private List buyers. The cascade could go on and on. Anyways, open another ticket on this one. I think maybe it is a higher layer that I will personally will avoid alot...we avoided hibernate cascades and it got rid of alot of issues we were having...there are so many corner cases and worse yet is the newb developers to hibernate really had a hard time.

On Wed, Jul 10, 2013 at 3:41 PM, snazy notifications@github.com wrote:

I think in most use cases, people read objects from the entity manager, modify them and write them back using put(). The use case to provide an unmanaged object to execute an update should be rate - so a "manage()" method should be fine.

But that method should take care on relations in the object to get managed.

And the put() methods should persist new relation objects, too (currently it does not, I think).

— Reply to this email directly or view it on GitHubhttps://github.com/deanhiller/playorm/issues/112#issuecomment-20775539 .

--- Google "Predictive Project Management" and spark a movement ---

deanhiller / playorm

Too many single reads before batch update - performance penalty #112