Fault tolerant data input

philrz commented 1 year ago

At the time this issue is being opened, Zed is at commit 58e7993.

We've had a few community issues that speak to a desire tor fault tolerant data input, e.g., when a parse error is encountered when reading one of Zed's supported formats, skip over the "bad data" and continue reading more "good" data, when possible.

JSON - https://github.com/brimdata/zed/issues/4106
CSV - https://github.com/brimdata/zed/issues/4514
Format unspecified - https://github.com/brimdata/zui/issues/2756

Indeed, we can see how this could be quite handy in some use cases, such as a user with a large amount of data that has such a parse error deep inside it somewhere. Their goal might be to find a "needle in a haystack" such that if a quick search on the non-corrupt parts reveals what they're looking for, they're done, so in that case having to pause and make the data 100% clean just to read it in and start searching is a hindrance. A quick survey of other tools does show that many CSV/JSON readers do indeed often have options to skip over parsing errors, so Zed could offer a similar option for readers where it makes sense.

In a group discussion a novel approach was proposed where Zed could turn the "bad data" into error values that include the bad data itself and a timestamp. As an alternative to just dropping the data as many other tools might do, this would give the user an easier way to see what got skipped & why and perhaps still do crude searches against it or even clean it up and commit the data into new values.

As suggested in a comment in #4514, once we have this functionality it would probably be helpful to surface them to clients like Zui in a way that draws the user's attention to the presence & count of the error when they happen. Follow-on issues to deal with that may be opened once this base functionality exists.

philrz commented 1 year ago

Here's a crude example of implementing the proposal using existing building blocks.

$ zq -version
Version: v1.7.0-41-g58e7993d

$ cat messages.ndjson 
{"message": "One"}
{"message": "Two"}
Message
{"message": "Three"}

$ zq -i json messages.ndjson 
messages.ndjson: invalid character 'M' looking for beginning of value

$ zq -z -i line 'yield (parse_zson(this) == null) ? error({failed_to_parse: this, at: now()}): parse_zson(this)' messages.ndjson 
{message:"One"}
{message:"Two"}
error({failed_to_parse:"Message",at:2023-04-25T19:05:49.367519Z})
{message:"Three"}

philrz commented 11 months ago

There was another recent request for this functionality in https://github.com/brimdata/zui/issues/2933. The user made a couple suggestions that might be worth considering in our design here:

Including the line number from an input file in the error value could be very helpful if the user wants to go back and fix the syntax error in the data source. We already have line numbers in some error messages in the existing reader, e.g., with the test data from #4514:

$ zq -version
Version: v1.12.0-3-gec5165f0

$ zq imdb.csv 
imdb.csv: record on line 6: wrong number of fields

It's somewhat orthogonal, but when the data is being imported to a Zed lake, as an alternative to having the errors become values in the pool, they could potentially be redirected to a separate pool. I've heard other community users make similar suggestions in other contexts, e.g., where to send the debug output described in #4487. So kind of like how the UNIX shells allow redirecting of stderr and stdout to different destinations or merging them into one output stream, perhaps Zed could echo this approach since it would be familiar to most users.

philrz commented 9 months ago

Note the changes over the years with "partial loads" as captured in https://github.com/brimdata/zui/issues/2660.

brimdata / super

Fault tolerant data input #4546