Improve auto-detection errors

philrz commented 4 years ago

Autodetection is working as intended in terms of autodetection. However, it isn't providing very helpful feedback upon failures. The reason for this is that it tries each format in turn and returns a "malformed input" if none succeed.

Here's an example of how autodetection occludes the useful error message:

19:03 ~/work/looky/zq(no-dupe-record-fields)
$ cat !$
cat ~/tmp/stuff.zson
#0:record[foo:record[bar:string,bar:string]]
0:[["1";"2";]]
19:03 ~/work/looky/zq(no-dupe-record-fields)
$ zq ~/tmp/stuff.zson 
malformed input
19:03 ~/work/looky/zq(no-dupe-record-fields)
$ zq -i zng ~/tmp/stuff.zson 
line 1: duplicate fields in record type

A possible solution here would be:

Each reader needs to distinguish between syntax errors (signaled via the SyntaxError err) and higher-level semantic errors such as duplicate fields (above), or a value not matching its record type, etc. We may already be doing this perfectly but should check as part of this issue.
The autodetector should keep each reader's error, and upon failure of all readers, if any error is a non-Syntax error, that one should be used.

Other options:

Collect all errors, and output a summary (“ndjson error: x, zeek error: y, …”)
Use the file extension as a hint of which candidate reader’s error to return. For example, if the filename is foo.zng, then use the zng reader’s error message. This works better for some formats than others, for example .json could be ndjson or zjson; .log could be ndjson or zeek.

These are non-exclusive: a good solution might combine elements of more than one of these. And there might of course be other ways to improve this.

philrz commented 4 years ago

Verified in zq commit 407b5f6. Using an invalid input file, we now see the error output for every input format attempted during auto-detect.

# cat stuff 
#0:record[foo:record[bar:string,bar:string]]
0:[["1";"2";]]

# zq stuff
stuff: format detection error
    tzng: line 1: duplicate fields in record type
    zeek: line 2: bad types/fields definition in zeek header
    ndjson: line 1: Unknown value type
    zjson: line 1: invalid character '#' looking for beginning of value
    zng: zng descriptor out of range

Thanks @henridf!

henridf commented 4 years ago

For a future reader coming back across this issue: I went with only this part of the descriptions proposed solution in the issue description.

The autodetector should keep each reader's error, and upon failure of all readers, if any error is a non-Syntax error, that one should be used.

I did not attempt do the distinction between syntax and semantic errors, figuring that this was already a step forward, and after some soaking time we can revisit and improve this if necessary.

brimdata / super

Improve auto-detection errors #494