LinkedInAttic / Cubert

Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop
http://linkedin.github.io/Cubert/
Apache License 2.0
245 stars 61 forks source link

Cannot read partitioned avro files #4

Closed jarutis closed 9 years ago

jarutis commented 9 years ago

Hi,

I've tried loading avro files with the following stucture:

/path/to/avro/daily/year=2014/month=12/day=05/country=de/de-r-00000.avro

Using the following script:

JOB "job1"
        REDUCERS 50;
        MAP {
                input = LOAD "/path/to/avro" USING AVRO;
        }
...
END

But I get the following error:

[Dependency Analyzer] Program inputs: [/path/to/avro]

Cannot compile cubert script. Exiting.
java.lang.RuntimeException: java.io.IOException: there are no files in /path/to/avro
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:277)
    at com.linkedin.cubert.analyzer.physical.PhysicalPlanWalker.walk(PhysicalPlanWalker.java:75)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.rewrite(DependencyAnalyzer.java:91)
    at com.linkedin.cubert.ScriptExecutor.rewrite(ScriptExecutor.java:319)
    at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:481)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.io.IOException: there are no files in /path/to/avro
    at com.linkedin.cubert.utils.AvroUtils.getSchema(AvroUtils.java:71)
    at com.linkedin.cubert.io.avro.AvroStorage.getPostCondition(AvroStorage.java:109)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.getPostCondition(DependencyAnalyzer.java:309)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:262)
    ... 9 more

Everything works perfectly fine if I load de-r-00000.avro file directly. But not if I point to the directory with partitions.

mvarshney commented 9 years ago

Hi Jonas,

Cubert does not currently support Hive-style partitioned folder organization. If you wish to read all data, you can try:

load "/path/to/avro/daily/year=/month=/day=/country=" using AVRO;

hope that helps.

Best, -Maneesh

On Sun, Dec 7, 2014 at 7:30 AM, Jonas Jarutis notifications@github.com wrote:

Hi,

I've tried loading avro files with the following stucture:

/path/to/avro/daily/year=2014/month=12/day=05/country=de/de-r-00000.avro

Using the following script:

JOB "job1" REDUCERS 50; MAP { input = LOAD "/path/to/avro" USING AVRO; } ... END

But I get the following error:

[Dependency Analyzer] Program inputs: [/path/to/avro]

Cannot compile cubert script. Exiting. java.lang.RuntimeException: java.io.IOException: there are no files in /path/to/avro at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:277) at com.linkedin.cubert.analyzer.physical.PhysicalPlanWalker.walk(PhysicalPlanWalker.java:75) at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.rewrite(DependencyAnalyzer.java:91) at com.linkedin.cubert.ScriptExecutor.rewrite(ScriptExecutor.java:319) at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:481) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.io.IOException: there are no files in /path/to/avro at com.linkedin.cubert.utils.AvroUtils.getSchema(AvroUtils.java:71) at com.linkedin.cubert.io.avro.AvroStorage.getPostCondition(AvroStorage.java:109) at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.getPostCondition(DependencyAnalyzer.java:309) at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:262) ... 9 more

Everything works perfectly fine if I load de-r-00000.avro file directly. But not if I point to the directory with partitions.

— Reply to this email directly or view it on GitHub https://github.com/linkedin/Cubert/issues/4.

jarutis commented 9 years ago

This works, thanks.