Issue when doing any join except full outer

bfemiano / accumulo-hive-storage-manager

Working commits for Hive connector to Accumulo. This will eventually be checked directly into Accumulo.

Apache License 2.0

13 stars 12 forks source link

Issue when doing any join except full outer #4

Open carlaustin opened 11 years ago

carlaustin commented 11 years ago

Hadoop Version: 1.3 When using the accumulo storage manager to do simple joins, example below, an IllegalArguementException is thrown with the message "null columnMapping not allowed". I have worked around this by modifying initAccumuloSerdeParameters to set the COLUMN_MAPPINGS and LIST_COLUMN_TYPES on conf as the values in properties. It also appears that type strings can be both colon and comma separated so I created a new method in AccumuloHiveUtils to split column types string using a pattern of :|,

This now enables all types of joins. I'm happy to send the changes your way, but I do wonder whether this is a workaround rather than a fix to the root issue, but my knowledge of Hive Storage managers is very small so I can't really determine if I should be doing it differently.

example join SELECT * from tablea a JOIN tableb b ON a.id = b.id

bfemiano commented 11 years ago

Both those properties are exposed and configurable already via the AccumuloSerde.java and serdeConstants.java respectively. The example join you demonstrated should be possible when joining on simple scalar value types (int, double, etc.)

carlaustin commented 11 years ago

I know that they are exposed, but when doing the join the null column mapping error occurs. On putting a load of logging in, the .get(COLUMN_MAPPINGS) returns null, but only when doing a join, even though it should have been set. I debugged into the AccumuloSerde.initAccumuloSerdeParameters and it was clear that COLUMN_MAPPINGS was correctly on the properties object, but didn't seem to get set on the job conf, hence the null. This means that the join doesn't work, I tried it plenty of times and plenty of ways with very simple data (two tables with just a couple of columns). FULL OUTER JOIN and any non-join query worked, but any other join throws the error. Note the joins I did were on strings.

I can send you the diffs I used make it work for me if you would like.

For info I was using the HortonWorks HDP1.3 sandbox with Accumulo installed on top to replicate and debug this issue.

I said Hadoop v1.3 in the OP by mistake, I meant 1.2.

bfemiano commented 10 years ago

I am about to revisit the codebase and I will see if I can reproduce this. Many of my initial test cases in the ACLED scripts did inner joins similar to the one you desrcribe not working, although not necessarily on Strings. I will see if I can reproduce on CDH4.5.

Thanks and sorry this took so long.

carlaustin commented 10 years ago

No worries, I actually fixed it myself in my codebase. I've also implemented basic INSERT INTO in my codebase, but this is tied to other code. I could have a look at making it more generic and providing it if you would be interested?

bfemiano commented 10 years ago

Sure. That would be great. I'm going to implement a simple Mutation based output format.

Will you be at the June 12th summit?

On Wed, May 14, 2014 at 11:06 AM, carlaustin notifications@github.comwrote:

No worries, I actually fixed it myself in my codebase. I've also implemented basic INSERT INTO in my codebase, but this is tied to other code. I could have a look at making it more generic and providing it if you would be interested?

— Reply to this email directly or view it on GitHubhttps://github.com/bfemiano/accumulo-hive-storage-manager/issues/4#issuecomment-43092619 .

carlaustin commented 10 years ago

I won't be at the summit unfortunately.

I've already created an OutputFormat and RecordWriter that write mutations from rows of data serialized in the AccumuloSerde. I'll look into replacing the non-generic bits so I can share it with you.

joshelser commented 10 years ago

For full closure, I've run a few joins so far with success in the code heading towards Hive. I'll try to add some more to exhaust the join types, but I think I have this fixed already.