Open philrz opened 1 year ago
Here's a crude example of implementing the proposal using existing building blocks.
$ zq -version
Version: v1.7.0-41-g58e7993d
$ cat messages.ndjson
{"message": "One"}
{"message": "Two"}
Message
{"message": "Three"}
$ zq -i json messages.ndjson
messages.ndjson: invalid character 'M' looking for beginning of value
$ zq -z -i line 'yield (parse_zson(this) == null) ? error({failed_to_parse: this, at: now()}): parse_zson(this)' messages.ndjson
{message:"One"}
{message:"Two"}
error({failed_to_parse:"Message",at:2023-04-25T19:05:49.367519Z})
{message:"Three"}
There was another recent request for this functionality in https://github.com/brimdata/zui/issues/2933. The user made a couple suggestions that might be worth considering in our design here:
$ zq -version
Version: v1.12.0-3-gec5165f0
$ zq imdb.csv
imdb.csv: record on line 6: wrong number of fields
Note the changes over the years with "partial loads" as captured in https://github.com/brimdata/zui/issues/2660.
At the time this issue is being opened, Zed is at commit 58e7993.
We've had a few community issues that speak to a desire tor fault tolerant data input, e.g., when a parse error is encountered when reading one of Zed's supported formats, skip over the "bad data" and continue reading more "good" data, when possible.
Indeed, we can see how this could be quite handy in some use cases, such as a user with a large amount of data that has such a parse error deep inside it somewhere. Their goal might be to find a "needle in a haystack" such that if a quick search on the non-corrupt parts reveals what they're looking for, they're done, so in that case having to pause and make the data 100% clean just to read it in and start searching is a hindrance. A quick survey of other tools does show that many CSV/JSON readers do indeed often have options to skip over parsing errors, so Zed could offer a similar option for readers where it makes sense.
In a group discussion a novel approach was proposed where Zed could turn the "bad data" into
error
values that include the bad data itself and a timestamp. As an alternative to just dropping the data as many other tools might do, this would give the user an easier way to see what got skipped & why and perhaps still do crude searches against it or even clean it up and commit the data into new values.As suggested in a comment in #4514, once we have this functionality it would probably be helpful to surface them to clients like Zui in a way that draws the user's attention to the presence & count of the error when they happen. Follow-on issues to deal with that may be opened once this base functionality exists.