branky / cascading.hive

Provide support for reading/writing data in Hive native file format in Cascading.
Other
11 stars 17 forks source link

Problems reading from a partitioned Hive table #5

Closed galarragas closed 10 years ago

galarragas commented 10 years ago

I am using the HCatTap to read from a partitioned Hive table. The table is partitioned into this pattern of paths:

hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=<partition>/

where every directory contains a file with name MPP-CONSOLIDATED-OrderReport-.osv giving the following example path

hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=20100501MPP-CONSOLIDATED-OrderReport-.osv

I am getting this error

Caused by: java.io.IOException: Not a file: hdfs://nameservice1/datasets/nowtv/mpp/mpp_order_report/p=20100501
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:212)
    at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
    at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:134)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1106)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1098)
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:177)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:995)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:922)
    at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

Looking at the code I actually see that the HCatTap gets the location for every partition and passes to the MultiSourceTap but the actual location of the source is under the partition directory

branky commented 10 years ago

Yes, this is a known issue. If your fix works well, feel free to send a pull request. Thanks for your contribution!

galarragas commented 10 years ago

So, it looks like I have a fix (I'm testing it locally with a test job). I'm planning to write some proper unit test and do the pull request soon

galarragas commented 10 years ago

I did the pull request. Let me know if you want me to improve the solution. It is currently working for my use case and for the unit test I added

branky commented 10 years ago

Unit test is perfect! Thank you.

galarragas commented 10 years ago

Thanks a lot. I actually had problems recreating a scenario with multiple partitions because I wasn't able to store in the db the location of the different partitions. I ended up using a table with a single partition. If you know how to configure the db properly would be a nice extension to the test