CompanyBook / massive_record

HBase ruby client
131 stars 39 forks source link

Add support for Scanners with a prefix #89

Closed nkeyes closed 12 years ago

nkeyes commented 12 years ago

I have implemented support for prefixed scanners here. I have also updated the thrift dependency from 0.6.0 to 0.8.0 because of a bug in 0.6.0

I believe that the behavior of the :starts_with key is incorrect. With HBase when you specify just the start key, the expected behavior is to get every record starting with that key to the end of the table but massive record uses it as a filter for returned rows. Not only is this very inefficient for large rowsets, HBase thrift provides a scanner option for just such a use case: scannerOpenWithPrefix.

vincentp commented 12 years ago

I agree this can be misleading, I will merge your pull request and add a bit of documentation, thanks a lot :)

vincentp commented 12 years ago

I did merge a bit too fast. Actually, what you are trying to do here you can already do it using the 'offset' option. It will start scanning the table for a given key to the end of the table. I agree that it should use scannerOpenWithPrefix instead of the scannerOpen though. So could you update that one instead of creating a new option and re-open the pull request?

nkeyes commented 12 years ago

I think maybe I didn't effectively communicate the intent of my pull request. I wanted a function that returns all records that have a key that begins with a a given value; the rough equivalent of select * from table where id like 'prefix%'; in MySQL. Massive_record actually already provided this functionality with the :starts_with key but does it very inefficiently: it opens a scanner that reads every record from the table starting from the id passed to :starts_with and then in Ruby tests the id with a regex to see if it starts with the :starts_with value. This will not scale. For example; if you have a table with 200,000,000 records, and your :starts with value matches the record at 100,000,001 but only 1,000 record actually have an id that begins with :starts_with, Ruby is running the regex test on 100,000,000 records, of which, only 1,000 are returned. my :start_prefix implementation offloads that work to HBase where it belongs and also doesn't change the behavior of :starts_with for those that have written code that depend on that 'broken' behavior.

vincentp commented 12 years ago

Ok, I understand, but I don't see why the :starts_with key option prevents you doing so. So if your table have the rows A1, B1, B2, C1 (in HBase rows are ordered by keys name) and you start scanning your table with :starts_with => "B", it will put a pointer at B1 and scan down to B2. If there is 100 000 000 of AX, it will simply skip it on the HBase side, the rows will never go through Thrift or MassiveRecord.

The Regex was here to stop at "C1" as it doesn't starts with "B". The :offset option allows you to paginate those data, if you have B1, B2, B3, B4, C1 you can start at :offset => "B3" and still use :starts_with => "B".

So finally scannerOpenWithPrefix is only helpful if you want all rows starting by "B", but you can't start at "B3" for example (or you will have to skip rows by yourself).

nkeyes commented 12 years ago

In your example, the problem is not it the 100,000,000 'AX' records before the 'B' records, its all the 'C' records after what I am looking for that are the issue. Yes, :starts_with is effective in skipping rows before what you are looking for, but it still returns everything after too. For example; lets say the table still has 200,000,000 records, but none that start with 'A', everything starts with 'B', 'C', "D', or 'E' etc. Further, let's still say that there are only 1,000 'B' records. In this case, :starts_with 'B' doesn't skip any records at all because there just happen not to be any 'A' records, so it passes all 200,000,000 record back to ruby to be evaluated by the regex in Ruby. :start_prefix => 'B' not only offloads skipping all of 'A' records to HBase, it also offloads stopping at the 'C' records to HBase.