ParallelAI / SpyGlass

Cascading and Scalding wrapper for HBase with advanced read features
Apache License 2.0
54 stars 31 forks source link

nulls in hbase #5

Closed koertkuipers closed 11 years ago

koertkuipers commented 11 years ago

I noticed that when writing to HBasePipeWrapper.toBytesWritable replaces nulls by empty strings which end up being empty byte arrays stored in hbase. It looks like HBaseScheme would throw a NPE if one tried to store a null.

When reading from hbase HBaseScheme will replace nulls (which represent missing column values) by empty byte arrays.

hbase is a sparse store after so why not benefit from this?

I propose that when writing to hbase we do not replace nulls by empty strings but instead do not write them at all. and when reading from hbase we put the nulls that hbase emits for missing values in the cascading tuple instead of replacing it by empty byte arrays. this works well because the null is also used in cascading tuples to represent missing values.

are there any downsides to this approach? i have pull request ready if this is considered a good idea.

koertkuipers commented 11 years ago

to be clear i am talking about nulls for cell values, not for keys

crajah commented 11 years ago

Hi Koert,

I like your proposal. Please go ahead and implement it. I'll accept your pull request.

Cheers, Chandan


Chandan Rajah http://www.chandanrajah.com

On 5 Aug 2013, at 22:42, koertkuipers notifications@github.com wrote:

I noticed that when writing to HBasePipeWrapper.toBytesWritable replaces nulls by empty strings which end up being empty byte arrays stored in hbase. It looks like HBaseScheme would throw a NPE if one tried to store a null.

When reading from hbase HBaseScheme will replace nulls (which represent missing column values) by empty byte arrays.

hbase is a sparse store after so why not benefit from this?

I propose that when writing to hbase we do not replace nulls by empty strings but instead do not write them at all. and when reading from hbase we put the nulls that hbase emits for missing values in the cascading tuple instead of replacing it by empty byte arrays. this works well because the null is also used in cascading tuples to represent missing values.

are there any downsides to this approach? i have pull request ready if this is considered a good idea.

— Reply to this email directly or view it on GitHub.