Closed ghost closed 11 years ago
Can you please have a look at the latest code? 0.7.0 is fairly different from 0.6.0 and it looks that we are now doing the right thing:
def getSplits(context: JobContext): java.util.List[InputSplit] = {
getInputFormats(context).flatMap { case (channel, format) =>
val conf = extractChannelConfiguration(context, channel)
I can't run 0.7.0 as I have a raft of code which doesn't run on 2.10 and it looks like 0.7.0 only targets 2.10?
That said, I looked at the code, and it doesn't look to be fixed there - the issue is that the InputFormat is being instantiated via ReflectionUtils and passed in the un-extracted configuration (ChannelsInputFormat.getInputFormats, line 177 in master). That code is in getInputFormats. In my change, I pass conf out of getInputFormats just for efficiency rather than extracting twice - the real point is that the conf that the InputFormat gets instantiated with is the extracted version for that InputFormat, not the common one for ChannelsInputFormat.
Similarly in ChannelRecordReader it instantiates the InputFormat with the nullary constructor (line 196) and so TableInputFormat NPEs if I recall correctly.
@rmellgren - on a side note, are you able to share/contribute the code that implements a Scoobi DataSource
for TableInputFormat
.
And to answer your previous questions, yes, 0.7.0 only targets 2.10.
Thanks, I'm going to have a more precise look at this issue.
@blever sure, I have one for DBInputFormat too, though I haven't tested that one yet. how do you want it?
PR. Probably add a directory such as src/main/scala/com/nicta/scoobi/io/table
and/or src/main/scala/com/nicta/scoobi/io/db
and place your additions there. Bonus points if you have any Specs too :)
it'd need deps added for HBase, which pulls in a ton of things. should I add that too? also, sorry, no specs, though I can look at writing a couple if you have some examples.
That's true and a good point. HBase is kind of separate to the Hadoop project which means we can't make assumptions about particular JARs being available in certain places, etc which always complicates launching and running.
If you're happy to contribute we have 2 choices:
scoobi-hbase
.The latter may be a better choice and is similar to what was done for the MongoDB support - https://github.com/mongodb/mongo-hadoop/tree/master/scoobi
However, in both cases, I think the directory structure of the new files will be the same. So, if you made a PR targeting it as an addition to the core scoobi project, we can make the call later.
Does that sound reasonable?
okay, I'll try to get some time later this week to do a new project, and maybe prod me if I forget
Just a PR against the existing Scoobi project would be enough for now. Thanks!
Hi Ross, I have now incorporated your changes to the 0.7.0 codebase (please forget about the commit mess on this issue, the relevant one is c58f7fc).
cool, thanks!
When an InputFormat is initialized using ReflectionUtils it's passed a Configuration which gets handed to the InputFormat via setConf. For the ChannelInputFormat, that Configuration is not always correct. This works out most of the time because it becomes correct by the time getSplits is called, and so InputFormats which don't bother to consult the Configuration passed in setConf work out just fine.
However, TableInputFormat from HBase is an InputFormat which does care.
The places where initialization without/with the wrong configuration are during ChannelInputFormat#getSplits and in ChannelRecordReader#originalRR.
I've patched this in a branch, here's the commit:
https://github.com/paytronix/scoobi/commit/da65f8f25aaa11690e3ac8d08d91e74ac36adc6a
For me, this allows HBase TableInputFormat to be used as a Scoobi DataSource.