Scan/Get with a where clause

jchailloux commented 10 years ago

Is it possible to think about adding a whereclause parameter to limit the number of rows.

If we have a HBase table that contains status (0/1), I would like to scan or get data where status = 1.

hildrum commented 10 years ago

I think that if we don't push that filtering to the region servers, we should suggest people do that in Streams (via a filter operator or a functor, for example).

But, I have been thinking about how we might allow people to specify filters. Probably we could have a filterClassName parameter and a filterLibrary parameter, and then expect that class name to be in the library, and use that as a filter. But that's not very easily useable, so I was wondering if there is a reasonable way to support some subset of filter functionality more naturally.

jchailloux commented 10 years ago

I had to make it available for a customer. ("HbaseSelect" on my side).

The first think it to be able to scan when with a partial key using a Filter (get in HBASE 0.99 could deal with partial key).

If the HBaseScan have an input port then you might create the Scan like that

                scan=new Scan();
                Filter filter = new PrefixFilter(startRow); //startRow is the partial key
                scan.setFilter(filter);

instead of

        scan = new Scan(startRow,endRow);

jchailloux commented 10 years ago

we could have parameters like that for the Where clause

whereclause : "GREATER" or  "EQUALS" or"LOWER"; // combinaison with equals could be done
wherclausevalue : "1";
whereclauseattribute : "status";

hildrum commented 10 years ago

A scan with a partial row key will be possible as soon as I make Get scan multiple rows or make Scan read from an input tuple.

This sort of where clause is easily supported on the Streams-side, as part of the Streams app. Do you currently have a client who needs to push it server-side?

jchailloux commented 10 years ago

This comes from a customer's use case I'm currently working with.

I have a HBase table that contains alerts. All alerts have a status (0/1 -> enable/disable). We need to all alerts for a specific ID (which is part of the key) and if possible where the status is 1. At this time the ColumnValue filter is not yet implemented (doing that right now).

We can still do the filtering in streams but the size and the number of alerts (Key contains timestamps) could be hudge. If we can reduce the number of rows at the source level it will be better from my point of view.

hildrum commented 10 years ago

How about this as a plan:

Add a parameter filterJar
Add a parameter createFilterClass

The the operator will call the createFilterClass's createFilter method, and get back a filter, which it will then use with the scan.

jchailloux commented 10 years ago

I think that asking to develop java classes is not a good solution.

The think is that the filter itself is one line as I wrote before. Asking to write one doesn't make sens for me. If we think having a full filter that contains all HBASE filters maybe.

In my case I only deal with (null == endRow) to switch between scan(filter) or scan(startRow,endRow).

jchailloux commented 10 years ago

Suggestion 1

Having a switch to deal with this filter in code itself as I did (I guess that 80% of the needs will be there)

Suggestion 2

Having a Abstract class that ask to the developper to implements all filters needed.

For sure the suggestion 2 will be the most powerfull but the SPL development model will be break. As soon as I have to develop in Java I could create my dedicated operator. The more descriptive we are the more SPL complient we will be.

hildrum commented 10 years ago

(1) A scan with a start row and appropriate end row is just as powerful as a scan with the prefix filter, right? Why would you prefer a prefix filter to using a startRow and endRow?

(2) As far as the question about status = 1, are you proposing the operator do that internally, or creating a filter to do it?

jchailloux commented 10 years ago

Example is a phone company.

You could have one contract with multiple phoneNumbers.

1 - you migh not kwow the startRow and the ednRow. Key is : contractID | phoneNumber | timestamp

You may want all data for the "current contract" or "contract | current customerId" or for "contract | phoneNumber | (year,month,day)".

That's why we need a filter. The key is partial. It's a where clause on the key and not on the value of qualifiers.

jchailloux commented 10 years ago

For the point 2 I'm trying to do it at the operator level.I would like to avoid having to write a dedicated filter in Java.

We could think about creating a Java class that contains all filters that can be used (all SQL where clauses). The key filter might be done by using the input tuple.

I'm open to discuss about having a generic Java class that could deal with all possible filters.

I'm expressing a customer whish.

hildrum commented 10 years ago

Okay, but if a prefix filter will work for that schema, so will startRow & and appropriately set endRow. If you just want everything with a contract ID, your start row is contractID, if you want everything with a particular customer, your start row is contractID|current customerID, if you want everything with a particular customer after a particular id, your start row is contractID|current customerID|startTimestamp.

To set the end row, you just increment the last byte on the prefix, then you can do exactly the prefix filter using startRow and endRow. But, having to increment the last byte to get the endRow is too messy, so the operator should do that internally.

So, I think maybe a rowPrefix parameter makes sense. You could then do a scan with:

all table
using startRow
using endRow
using startRow and endRow
using rowPrefix

Does this make sense?

jchailloux commented 10 years ago

I'm ok with a rowPrefix for a basicScan.

What about the same scan that have to use the "rowPrefix" as an attribute from the input tuple ?

This is the way the customer wants to scan.

hildrum commented 10 years ago

(2) I don't see the point of having the HBASEScan or HBASEGet operator do it. You can use a Filter operator downstream of the SPL operator to filter out the cases where status isn't the right value. Then you can use the full expressive power of SPL, and get exactly the filtering conditions you want, and you're not limited by whatever syntax we choose to support. By the time the operator handles it, the data has already been extracted from HBASE and shipped it over the network, so the additional cost of sending it to the spl Filter operator is likely to be small.

If you wanted to do server-side filtering, that might be worth the effort to implement. I think we could support a few simple filters that way. An equality filter might look something like rowFilter: "all:status==1";, and it'd only give back tuples from a row where the value for columnFamily all, columnQualifier status was equal to "1". (Internally, I'd split the string on '==' to get the all:status and the value "1", then I'd split the all:status on the :, to get the columnFamily and columnQualifier.) If someone had two such filters, they could do rowFilter: "...","...";

Your where clause example doesn't seem that extensible. How do you think someone would filter on two different attributes?

jchailloux commented 10 years ago

At this time I can live the downstream process. Because I have few values. If you have millions of values (HBase use case) it make more sense to filter outside streams.

For this use case(filtering on values) I could agree on having a separate Java classe that implements the filter. But, HBase propose simple ColumnValueFiler using GREATER, EQUALS,... that can be used at the constructor level.

BTW The most important whereclause is about the partial key and we agreed about rowPrefix for basic scan.We still have to deal with this "rowFilter" as an attribute.

For the value itself, we can still continue to discuss as we have a work around using functor or filter.

The idea you had of having a Filter class might be interesting for all use cases that have to deal with values instead of the scan for partial key coming from parameter or attribute.

In that case you should have to propose few kinds of filters that can can be used with extrenal parmeters (XML, properties, ....)

BruceGlassford commented 8 years ago

I've got a customer use case that needs this very badly. Many millions of records - need to pull back based on a value in a column value to get the subset I need. RowID prefixes won't work since the data has random prefixes for distribution. Pulling multiple terabytes of data over the network to Streams to filter them out will likely cause the network admins to be rather peeved with me, so the HbaseGet->Filter won't help. Will take a look at the code and see if I can figure out how to add it rather than writing my own operator to do it. Or if either of you has already got it in place, just needing testing PLEASE let me know...

IBMStreams / streamsx.hbase

Scan/Get with a where clause #32