ParallelAI / SpyGlass

Cascading and Scalding wrapper for HBase with advanced read features
Apache License 2.0
54 stars 31 forks source link

HBase raw tap #1

Closed rore closed 11 years ago

rore commented 11 years ago

This change adds HBaseRawTap and HBaseRawScheme.

The point of this tap is to avoid the need of defining the input columns in the mapper, and allowing more control in handling hbase rows in a cascading (and scalding) job.

The source tap outputs pairs of (rowkey, row), where rowkey is the actual row object. So it's possible to collect and manipulate a changing set of columns in the mapper without predefining them. So, for instance, the first mapper in the pipe can transform the row like this (using scalding syntax) :

hbaseSource.map(('rowkey, 'row) -> ('key, 'field1, 'field2, 'field3))

where the output fields can be a combination of different columns in each row.

The source tap also adds support for providing a scan object (base64 encoded) for fully customizing the HBase read.

The sink tap expects a rowkey in the tuple, and will write other values as columns.

Please note that I've bumped up the CDH versions to the latest (needed for a function in HBase that is missing in previous version). Also updated scala to latest version (any problem with that?)

crajah commented 11 years ago

HI Rotem,

I've pulled your changes in, could you add documentation around your code in the Readme or Wiki please.

Cheers, --- Chandan

rore commented 11 years ago

Great. Will do. Can you publish a new build with it? (BTW, why not publish to maven central?)