Closed jchailloux closed 5 years ago
I think that if we don't push that filtering to the region servers, we should suggest people do that in Streams (via a filter operator or a functor, for example).
But, I have been thinking about how we might allow people to specify filters. Probably we could have a filterClassName parameter and a filterLibrary parameter, and then expect that class name to be in the library, and use that as a filter. But that's not very easily useable, so I was wondering if there is a reasonable way to support some subset of filter functionality more naturally.
I had to make it available for a customer. ("HbaseSelect" on my side).
The first think it to be able to scan when with a partial key using a Filter (get in HBASE 0.99 could deal with partial key).
If the HBaseScan have an input port then you might create the Scan like that
scan=new Scan();
Filter filter = new PrefixFilter(startRow); //startRow is the partial key
scan.setFilter(filter);
instead of
scan = new Scan(startRow,endRow);
we could have parameters like that for the Where clause
whereclause : "GREATER" or "EQUALS" or"LOWER"; // combinaison with equals could be done
wherclausevalue : "1";
whereclauseattribute : "status";
A scan with a partial row key will be possible as soon as I make Get scan multiple rows or make Scan read from an input tuple.
This sort of where clause is easily supported on the Streams-side, as part of the Streams app. Do you currently have a client who needs to push it server-side?
This comes from a customer's use case I'm currently working with.
I have a HBase table that contains alerts. All alerts have a status (0/1 -> enable/disable). We need to all alerts for a specific ID (which is part of the key) and if possible where the status is 1. At this time the ColumnValue filter is not yet implemented (doing that right now).
We can still do the filtering in streams but the size and the number of alerts (Key contains timestamps) could be hudge. If we can reduce the number of rows at the source level it will be better from my point of view.
How about this as a plan:
filterJar
createFilterClass
The the operator will call the createFilterClass's createFilter method, and get back a filter, which it will then use with the scan.
I think that asking to develop java classes is not a good solution.
The think is that the filter itself is one line as I wrote before. Asking to write one doesn't make sens for me. If we think having a full filter that contains all HBASE filters maybe.
In my case I only deal with (null == endRow) to switch between scan(filter) or scan(startRow,endRow).
Suggestion 1
Having a switch to deal with this filter in code itself as I did (I guess that 80% of the needs will be there)
Suggestion 2
Having a Abstract class that ask to the developper to implements all filters needed.
For sure the suggestion 2 will be the most powerfull but the SPL development model will be break. As soon as I have to develop in Java I could create my dedicated operator. The more descriptive we are the more SPL complient we will be.
(1) A scan with a start row and appropriate end row is just as powerful as a scan with the prefix filter, right? Why would you prefer a prefix filter to using a startRow and endRow?
(2) As far as the question about status = 1, are you proposing the operator do that internally, or creating a filter to do it?
Example is a phone company.
You could have one contract with multiple phoneNumbers.
1 - you migh not kwow the startRow and the ednRow. Key is : contractID | phoneNumber | timestamp
You may want all data for the "current contract" or "contract | current customerId" or for "contract | phoneNumber | (year,month,day)".
That's why we need a filter. The key is partial. It's a where clause on the key and not on the value of qualifiers.
For the point 2 I'm trying to do it at the operator level.I would like to avoid having to write a dedicated filter in Java.
We could think about creating a Java class that contains all filters that can be used (all SQL where clauses). The key filter might be done by using the input tuple.
I'm open to discuss about having a generic Java class that could deal with all possible filters.
I'm expressing a customer whish.
Okay, but if a prefix filter will work for that schema, so will startRow & and appropriately set endRow.
If you just want everything with a contract ID, your start row is contractID
, if you want everything with a particular customer, your start row is contractID|current customerID
, if you want everything with a particular customer after a particular id, your start row is contractID|current customerID|startTimestamp
.
To set the end row, you just increment the last byte on the prefix, then you can do exactly the prefix filter using startRow and endRow. But, having to increment the last byte to get the endRow is too messy, so the operator should do that internally.
So, I think maybe a rowPrefix
parameter makes sense. You could then do a scan with:
Does this make sense?
I'm ok with a rowPrefix for a basicScan.
What about the same scan that have to use the "rowPrefix" as an attribute from the input tuple ?
This is the way the customer wants to scan.
(2) I don't see the point of having the HBASEScan or HBASEGet operator do it. You can use a Filter operator downstream of the SPL operator to filter out the cases where status
isn't the right value. Then you can use the full expressive power of SPL, and get exactly the filtering conditions you want, and you're not limited by whatever syntax we choose to support. By the time the operator handles it, the data has already been extracted from HBASE and shipped it over the network, so the additional cost of sending it to the spl Filter operator is likely to be small.
If you wanted to do server-side filtering, that might be worth the effort to implement. I think we could support a few simple filters that way. An equality filter might look something like
rowFilter: "all:status==1";
, and it'd only give back tuples from a row where the value for columnFamily all
, columnQualifier status
was equal to "1". (Internally, I'd split the string on '==' to get the all:status and the value "1", then I'd split the all:status on the :
, to get the columnFamily and columnQualifier.) If someone had two such filters, they could do rowFilter: "...","...";
Your where clause example doesn't seem that extensible. How do you think someone would filter on two different attributes?
At this time I can live the downstream process. Because I have few values. If you have millions of values (HBase use case) it make more sense to filter outside streams.
For this use case(filtering on values) I could agree on having a separate Java classe that implements the filter. But, HBase propose simple ColumnValueFiler using GREATER, EQUALS,... that can be used at the constructor level.
BTW The most important whereclause is about the partial key and we agreed about rowPrefix for basic scan.We still have to deal with this "rowFilter" as an attribute.
For the value itself, we can still continue to discuss as we have a work around using functor or filter.
The idea you had of having a Filter class might be interesting for all use cases that have to deal with values instead of the scan for partial key coming from parameter or attribute.
In that case you should have to propose few kinds of filters that can can be used with extrenal parmeters (XML, properties, ....)
I've got a customer use case that needs this very badly. Many millions of records - need to pull back based on a value in a column value to get the subset I need. RowID prefixes won't work since the data has random prefixes for distribution. Pulling multiple terabytes of data over the network to Streams to filter them out will likely cause the network admins to be rather peeved with me, so the HbaseGet->Filter won't help. Will take a look at the code and see if I can figure out how to add it rather than writing my own operator to do it. Or if either of you has already got it in place, just needing testing PLEASE let me know...
Is it possible to think about adding a whereclause parameter to limit the number of rows.
If we have a HBase table that contains status (0/1), I would like to scan or get data where status = 1.