Closed neonknight closed 12 years ago
That feature in pycassa derives from a very nice Python feature. Since ColumnFamily.get_range
implements an iterator, that is basically syntactic sugar for
rows = mycolfam.get_range()
while len(rows) > 0:
#do something
rows = mycolfam.get_range(column_start=rows.keys()[-1])
In that example, get_range
probably uses the default _rowcount 100 and repeats it while possible.
To have this kind of logic Cassandra::Simple (without the syntactic sugar) would probably have to use something like Tie::Hash::DxHash to have ordered keys in get_range
in order to properly iterate them (pycassa uses the Python native OrderedDict
).
I'll look into this to see if the extra functionality is worth the added complexity (but I think it is, since this is important functionality).
I'm trying to loop over all rows in a column family. This is well performing and very scalable in pycassa:
My test data set of ~200000 rows is being processed in 20s (10000rows/s) on a weak development virtual machine.
However, I haven't found a working solution in cassandra-simple. Using something like
will start loading all tokens from the database into memory. This works for _rowcount with a maximum of 10000. A bigger _rowcount will result in enormous times to finish the request. I'm talking about several hours for 200000 rows.
Is there a scalable way in cassandra-simple that I missed? Or is this impossible at the moment?