Cascading / lingual

Stand-alone ANSI SQL for Cascading on Apache Hadoop
http://www.cascading.org/lingual/
48 stars 17 forks source link

java.lang.AssertionError: table may be null; expr may not #10

Closed adersberger closed 10 years ago

adersberger commented 10 years ago

I'm trying to setup a simple log analysis application based on Cascading & Lingual but I get stuck into the following error:

2013-12-28 00:06:31,899 INFO  tap.TapSchema (TapSchema.java:addTapTableFor(125)) - adding table on schema: null, table: LOG_ENTRIES, fields: 'ip', 'time', 'method', 'event', 'status', 'size' | String, String, String, String, String, String, identifier: log_analysis_tcsv.txt
2013-12-28 00:06:31,905 INFO  tap.TapSchema (TapSchema.java:addTapTableFor(125)) - adding table on schema: null, table: results, fields: 'ip', identifier: log_analysis_ips.txt

java.lang.AssertionError: table may be null; expr may not
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl$RelOptTableImpl.<init>(OptiqPrepareImpl.java:806)
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl$RelOptTableImpl.<init>(OptiqPrepareImpl.java:784)
...
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl.prepareSql(OptiqPrepareImpl.java:195)
    at cascading.lingual.flow.SQLPlanner.resolveTails(SQLPlanner.java:111)
    at cascading.flow.planner.FlowPlanner.resolveAssemblyPlanners(FlowPlanner.java:150)

My application code:

Tap inTap = getPlatform().getTap(new SQLTypedTextDelimited( ",", "\""), IN_PATH, SinkMode.KEEP);

Tap outTap = getPlatform().getTap(new SQLTypedTextDelimited( new Fields("ip"), ",", "\""), OUT_PATH, SinkMode.REPLACE);

        //Define and execute flow
        FlowDef flowDef = FlowDef.flowDef()
                .addSource("LOG_ENTRIES", inTap)
                .addSink("results", outTap);

         String statement = "SELECT DISTINCT ip FROM LOG_ENTRIES";

         SQLPlanner sqlPlanner = new SQLPlanner().setSql(statement);
         flowDef.addAssemblyPlanner( sqlPlanner );

        getPlatform().getFlowConnector().connect(flowDef).complete();
fs111 commented 10 years ago

Hi,

Thx!

joeposner commented 10 years ago

I suspect you're running with assertions enabled. This is the default in IntelliJ and possibly other IDEs.

Make sure you're not passing the "-ea" flag to your JVM and you should be fine.

On Sat, Dec 28, 2013 at 4:43 AM, André Kelpe notifications@github.comwrote:

Hi,

  • Can you tell me which platform this is? local or hadoop?
  • Can you include the catalog invocations you used to describe the schema and table?

Thx!

— Reply to this email directly or view it on GitHubhttps://github.com/Cascading/lingual/issues/10#issuecomment-31295951 .

adersberger commented 10 years ago
cascading.flow.planner.PlannerException: could not build flow from assembly: [Index: 2, Size: 2]
    at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:576)
    at cascading.flow.local.planner.LocalPlanner.buildFlow(LocalPlanner.java:108)
    at cascading.flow.local.planner.LocalPlanner.buildFlow(LocalPlanner.java:40)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
    at TestLogAnalyzer.LogSqlAnalyzer(TestLogAnalyzer.java:101)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
    at org.junit.rules.RunRules.evaluate(RunRules.java:20)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at cascading.platform.PlatformRunner.runChild(PlatformRunner.java:295)
    at cascading.platform.PlatformRunner.runChild(PlatformRunner.java:61)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
    at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202)
    at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl.getColumnMetaDataList(OptiqPrepareImpl.java:400)
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl.prepare2_(OptiqPrepareImpl.java:357)
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl.prepare_(OptiqPrepareImpl.java:250)
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl.prepareSql(OptiqPrepareImpl.java:195)
    at cascading.lingual.flow.SQLPlanner.resolveTails(SQLPlanner.java:111)
    at cascading.flow.planner.FlowPlanner.resolveAssemblyPlanners(FlowPlanner.java:150)
    at cascading.flow.planner.FlowPlanner.resolveTails(FlowPlanner.java:137)
    at cascading.flow.local.planner.LocalPlanner.buildFlow(LocalPlanner.java:81)
    ... 41 more

When I debug into this the root cause seems to be in class "OptiqPrepareImpl" method "getColumnMetaDataList". The list of origins contains two elements in my case ("IP" and "LOG_ENTRIES") and the part of the method which fills the colums is trying to access a thrid entry.

cwensel commented 10 years ago

If you can convert this into a unit test and offer up a pull request, we can more easily resolve the issue.

adersberger commented 10 years ago

I created a gist with a junit test reproducer for this issue: https://gist.github.com/adersberger/cc11feef4edf82123bc5 Thx a lot!

joeposner commented 10 years ago

The contents file you're trying to read doesn't match the pattern you configuring SQLTypeTextDelimited with.

You're telling the parser that the file you're trying to read contains comma delimited text with a header line that describes the field names and contents. But yout file is a raw apache log which is space delimited and has no header line. If you look at the example code file "lingual-examples/src/resources/main/data/example/employee.tcsv" you can see the format you're telling Lingual to expect.

You've got two options to resolve this: 1) Modify the file with a Cascading flow prior to passing it to Lingual. 2) Pass different parameters to SQLTypeTextDelimited. The constructor QLTypedTextDelimited( Fields fields, String delimiter, String quote, boolean header, boolean strict, boolean safe ) gets you the most flexibility. Set header=false (since you have no header line) and set strict=false, safe=false for the most tolerant parsing. You'll have to be explicit in the Fields about what data types to expect since there's no header line with that info.

Given weblogs in general, for production use I'd strongly suggest going with (1) and doing a file cleanup first. It's generally inevitable that at some point someone ran a bot against your site and so you likely have a few lines were the URL contains binary data, or some other unexpected variance in the data. When reading text files, Lingual, as a JDBC layer, assumes that the text is as tightly organized. That won't be the case with a production apache log.

The error message you're getting is, admittedly, less than clear but that's somewhat of a chicken and egg problem with this issue. Lingual can't give you guidance about what it can't parse since it's expect to be able to (and then failing to) parse the first line of the file to get this the info about what it's trying to do.

On Mon, Dec 30, 2013 at 11:49 AM, Josef Adersberger < notifications@github.com> wrote:

I created a gist with a junit test reproducer for this issue: https://gist.github.com/adersberger/cc11feef4edf82123bc5 Thx a lot!

— Reply to this email directly or view it on GitHubhttps://github.com/Cascading/lingual/issues/10#issuecomment-31364378 .

joeposner commented 10 years ago

I should have clarified: I can see from your gist that you have code that does save the file to an intermediate file but I'm not sure that's working as you're expecting. But the error you're getting is one that I recognize from cases where I've set up mismatched files and parsers.

I'll take a closer look at the code and see exactly where the mismatch is occurring.

On Mon, Dec 30, 2013 at 8:20 PM, Joe Posner jposner@concurrentinc.comwrote:

The contents file you're trying to read doesn't match the pattern you configuring SQLTypeTextDelimited with.

You're telling the parser that the file you're trying to read contains comma delimited text with a header line that describes the field names and contents. But yout file is a raw apache log which is space delimited and has no header line. If you look at the example code file "lingual-examples/src/resources/main/data/example/employee.tcsv" you can see the format you're telling Lingual to expect.

You've got two options to resolve this: 1) Modify the file with a Cascading flow prior to passing it to Lingual. 2) Pass different parameters to SQLTypeTextDelimited. The constructor QLTypedTextDelimited( Fields fields, String delimiter, String quote, boolean header, boolean strict, boolean safe ) gets you the most flexibility. Set header=false (since you have no header line) and set strict=false, safe=false for the most tolerant parsing. You'll have to be explicit in the Fields about what data types to expect since there's no header line with that info.

Given weblogs in general, for production use I'd strongly suggest going with (1) and doing a file cleanup first. It's generally inevitable that at some point someone ran a bot against your site and so you likely have a few lines were the URL contains binary data, or some other unexpected variance in the data. When reading text files, Lingual, as a JDBC layer, assumes that the text is as tightly organized. That won't be the case with a production apache log.

The error message you're getting is, admittedly, less than clear but that's somewhat of a chicken and egg problem with this issue. Lingual can't give you guidance about what it can't parse since it's expect to be able to (and then failing to) parse the first line of the file to get this the info about what it's trying to do.

On Mon, Dec 30, 2013 at 11:49 AM, Josef Adersberger < notifications@github.com> wrote:

I created a gist with a junit test reproducer for this issue: https://gist.github.com/adersberger/cc11feef4edf82123bc5 Thx a lot!

— Reply to this email directly or view it on GitHubhttps://github.com/Cascading/lingual/issues/10#issuecomment-31364378 .

joeposner commented 10 years ago

There are a couple of things you want to do change:

I've enclosed a variant of your TestLogProcessing class with these changes that produces the results you'd expect.

On Mon, Dec 30, 2013 at 8:28 PM, Joe Posner jposner@concurrentinc.comwrote:

I should have clarified: I can see from your gist that you have code that does save the file to an intermediate file but I'm not sure that's working as you're expecting. But the error you're getting is one that I recognize from cases where I've set up mismatched files and parsers.

I'll take a closer look at the code and see exactly where the mismatch is occurring.

On Mon, Dec 30, 2013 at 8:20 PM, Joe Posner jposner@concurrentinc.comwrote:

The contents file you're trying to read doesn't match the pattern you configuring SQLTypeTextDelimited with.

You're telling the parser that the file you're trying to read contains comma delimited text with a header line that describes the field names and contents. But yout file is a raw apache log which is space delimited and has no header line. If you look at the example code file "lingual-examples/src/resources/main/data/example/employee.tcsv" you can see the format you're telling Lingual to expect.

You've got two options to resolve this: 1) Modify the file with a Cascading flow prior to passing it to Lingual. 2) Pass different parameters to SQLTypeTextDelimited. The constructor QLTypedTextDelimited( Fields fields, String delimiter, String quote, boolean header, boolean strict, boolean safe ) gets you the most flexibility. Set header=false (since you have no header line) and set strict=false, safe=false for the most tolerant parsing. You'll have to be explicit in the Fields about what data types to expect since there's no header line with that info.

Given weblogs in general, for production use I'd strongly suggest going with (1) and doing a file cleanup first. It's generally inevitable that at some point someone ran a bot against your site and so you likely have a few lines were the URL contains binary data, or some other unexpected variance in the data. When reading text files, Lingual, as a JDBC layer, assumes that the text is as tightly organized. That won't be the case with a production apache log.

The error message you're getting is, admittedly, less than clear but that's somewhat of a chicken and egg problem with this issue. Lingual can't give you guidance about what it can't parse since it's expect to be able to (and then failing to) parse the first line of the file to get this the info about what it's trying to do.

On Mon, Dec 30, 2013 at 11:49 AM, Josef Adersberger < notifications@github.com> wrote:

I created a gist with a junit test reproducer for this issue: https://gist.github.com/adersberger/cc11feef4edf82123bc5 Thx a lot!

— Reply to this email directly or view it on GitHubhttps://github.com/Cascading/lingual/issues/10#issuecomment-31364378 .

adersberger commented 10 years ago

Thank's a lot. After associating the table to a schema and typing all Fields everything is fine. Two things can make Lingual more robust against this: 1) A proper error/warning message would be cool if I try to associate a table to the root schema. 2) SQLTypedTextDelimited could have a validate() Method which checks if all Fields are typed. I try to submit a patch.