brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Improve auto-detection errors #494

Closed philrz closed 4 years ago

philrz commented 4 years ago

Autodetection is working as intended in terms of autodetection. However, it isn't providing very helpful feedback upon failures. The reason for this is that it tries each format in turn and returns a "malformed input" if none succeed.

Here's an example of how autodetection occludes the useful error message:

19:03 ~/work/looky/zq(no-dupe-record-fields)
$ cat !$
cat ~/tmp/stuff.zson
#0:record[foo:record[bar:string,bar:string]]
0:[["1";"2";]]
19:03 ~/work/looky/zq(no-dupe-record-fields)
$ zq ~/tmp/stuff.zson 
malformed input
19:03 ~/work/looky/zq(no-dupe-record-fields)
$ zq -i zng ~/tmp/stuff.zson 
line 1: duplicate fields in record type

A possible solution here would be:

Other options:

These are non-exclusive: a good solution might combine elements of more than one of these. And there might of course be other ways to improve this.

philrz commented 4 years ago

Verified in zq commit 407b5f6. Using an invalid input file, we now see the error output for every input format attempted during auto-detect.

# cat stuff 
#0:record[foo:record[bar:string,bar:string]]
0:[["1";"2";]]

# zq stuff
stuff: format detection error
    tzng: line 1: duplicate fields in record type
    zeek: line 2: bad types/fields definition in zeek header
    ndjson: line 1: Unknown value type
    zjson: line 1: invalid character '#' looking for beginning of value
    zng: zng descriptor out of range

Thanks @henridf!

henridf commented 4 years ago

For a future reader coming back across this issue: I went with only this part of the descriptions proposed solution in the issue description.

The autodetector should keep each reader's error, and upon failure of all readers, if any error is a non-Syntax error, that one should be used.

I did not attempt do the distinction between syntax and semantic errors, figuring that this was already a step forward, and after some soaking time we can revisit and improve this if necessary.