InputFormats not always initialized with correct configuration

NICTA / scoobi

A Scala productivity framework for Hadoop.

http://nicta.github.com/scoobi/

482 stars 97 forks source link

InputFormats not always initialized with correct configuration #243

Closed ghost closed 11 years ago

ghost commented 11 years ago

When an InputFormat is initialized using ReflectionUtils it's passed a Configuration which gets handed to the InputFormat via setConf. For the ChannelInputFormat, that Configuration is not always correct. This works out most of the time because it becomes correct by the time getSplits is called, and so InputFormats which don't bother to consult the Configuration passed in setConf work out just fine.

However, TableInputFormat from HBase is an InputFormat which does care.

The places where initialization without/with the wrong configuration are during ChannelInputFormat#getSplits and in ChannelRecordReader#originalRR.

I've patched this in a branch, here's the commit:

https://github.com/paytronix/scoobi/commit/da65f8f25aaa11690e3ac8d08d91e74ac36adc6a

For me, this allows HBase TableInputFormat to be used as a Scoobi DataSource.

etorreborre commented 11 years ago

Can you please have a look at the latest code? 0.7.0 is fairly different from 0.6.0 and it looks that we are now doing the right thing:

  def getSplits(context: JobContext): java.util.List[InputSplit] = {

    getInputFormats(context).flatMap { case (channel, format) =>
      val conf = extractChannelConfiguration(context, channel)

ghost commented 11 years ago

I can't run 0.7.0 as I have a raft of code which doesn't run on 2.10 and it looks like 0.7.0 only targets 2.10?

That said, I looked at the code, and it doesn't look to be fixed there - the issue is that the InputFormat is being instantiated via ReflectionUtils and passed in the un-extracted configuration (ChannelsInputFormat.getInputFormats, line 177 in master). That code is in getInputFormats. In my change, I pass conf out of getInputFormats just for efficiency rather than extracting twice - the real point is that the conf that the InputFormat gets instantiated with is the extracted version for that InputFormat, not the common one for ChannelsInputFormat.

Similarly in ChannelRecordReader it instantiates the InputFormat with the nullary constructor (line 196) and so TableInputFormat NPEs if I recall correctly.

blever commented 11 years ago

@rmellgren - on a side note, are you able to share/contribute the code that implements a Scoobi DataSource for TableInputFormat.

And to answer your previous questions, yes, 0.7.0 only targets 2.10.

etorreborre commented 11 years ago

Thanks, I'm going to have a more precise look at this issue.

ghost commented 11 years ago

@blever sure, I have one for DBInputFormat too, though I haven't tested that one yet. how do you want it?

blever commented 11 years ago

PR. Probably add a directory such as src/main/scala/com/nicta/scoobi/io/table and/or src/main/scala/com/nicta/scoobi/io/db and place your additions there. Bonus points if you have any Specs too :)

ghost commented 11 years ago

it'd need deps added for HBase, which pulls in a ton of things. should I add that too? also, sorry, no specs, though I can look at writing a couple if you have some examples.

blever commented 11 years ago

That's true and a good point. HBase is kind of separate to the Hadoop project which means we can't make assumptions about particular JARs being available in certain places, etc which always complicates launching and running.

If you're happy to contribute we have 2 choices:

Add to Scoobi project;
Create a new project, e.g. scoobi-hbase.

The latter may be a better choice and is similar to what was done for the MongoDB support - https://github.com/mongodb/mongo-hadoop/tree/master/scoobi

However, in both cases, I think the directory structure of the new files will be the same. So, if you made a PR targeting it as an addition to the core scoobi project, we can make the call later.

Does that sound reasonable?

ghost commented 11 years ago

okay, I'll try to get some time later this week to do a new project, and maybe prod me if I forget

blever commented 11 years ago

Just a PR against the existing Scoobi project would be enough for now. Thanks!

etorreborre commented 11 years ago

Hi Ross, I have now incorporated your changes to the 0.7.0 codebase (please forget about the commit mess on this issue, the relevant one is c58f7fc).

ghost commented 11 years ago

cool, thanks!