fmgoncalves / p5-cassandra-simple

Cassandra::Simple Perl Module - Easy to use, Perl oriented client interface to Apache Cassandra.
http://fmgoncalves.github.com/p5-cassandra-simple
9 stars 3 forks source link

loop over all rows of a column family #10

Closed neonknight closed 12 years ago

neonknight commented 12 years ago

I'm trying to loop over all rows in a column family. This is well performing and very scalable in pycassa:

for row in mycolfam.get_range():

My test data set of ~200000 rows is being processed in 20s (10000rows/s) on a weak development virtual machine.

However, I haven't found a working solution in cassandra-simple. Using something like

$conn->get_range('mycolfam', {'row_count'=>$maxrows});

will start loading all tokens from the database into memory. This works for _rowcount with a maximum of 10000. A bigger _rowcount will result in enormous times to finish the request. I'm talking about several hours for 200000 rows.

Is there a scalable way in cassandra-simple that I missed? Or is this impossible at the moment?

fmgoncalves commented 12 years ago

That feature in pycassa derives from a very nice Python feature. Since ColumnFamily.get_range implements an iterator, that is basically syntactic sugar for

rows = mycolfam.get_range()
while len(rows) > 0:
  #do something
  rows = mycolfam.get_range(column_start=rows.keys()[-1])

In that example, get_range probably uses the default _rowcount 100 and repeats it while possible.

To have this kind of logic Cassandra::Simple (without the syntactic sugar) would probably have to use something like Tie::Hash::DxHash to have ordered keys in get_range in order to properly iterate them (pycassa uses the Python native OrderedDict). I'll look into this to see if the extra functionality is worth the added complexity (but I think it is, since this is important functionality).