brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Tolerate duplicate fields in NDJSON records #2523

Closed philrz closed 3 years ago

philrz commented 3 years ago

Repro is with zq commit 99477b6.

Consider the following NDJSON:

$ cat has_dup.ndjson 
{"foo": 0}
{"foo": 1, "bar": 2, "foo": 3}
{"foo": 2}

At the moment, the presence of the duplicate field in the middle record causes zq to stop processing the stream entirely.

$ zq -f ndjson -i ndjson has_dup.ndjson 
has_dup.ndjson: line 2: field foo is repeated

When digging for background on the topic, I found my way to the JSON RFC 7159 which covers it pretty clearly:

The names within an object SHOULD be unique.
...
When the names within an object are not unique, the behavior of software that receives such an object is unpredictable.  Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.

By this definition, if we simply skipped over records that contain such duplicates we'd be in-line with the RFC behavior. However, in terms of what's best for the user, we should definitely not be bailing out on the whole stream. For instance, we bumped into this problem in brimdata/brimcap#13 where Suricata v6 was generating output with such duplicate fields. In that case it was causing the entire Zed processing of the log stream to fail and effectively broke the ability to import pcap data via Brimcap. Being tolerant of the input would have been an improvement.

To compare with another tool, jq happens to quietly tolerate the presence of the duplicated field, apparently taking the approach of reporting "the last name/value pair only".

$ cat has_dup.ndjson | jq -c .
{"foo":0}
{"foo":3,"bar":2}
{"foo":2}

I'd advocate that we adapt the same behavior of jq, since it's tolerant and would help us work around the problem described in brimdata/brimcap#13. As a team we also discussed printing a warning when this happens.

fixes https://github.com/brimdata/brimcap/issues/13

mccanne commented 3 years ago

@philrz it's a bit easier to take the first duplicate instead of the last, which is what I implemented in the PR, but we can change it to take the last if you think this is important.

philrz commented 3 years ago

@mccanne: Sounds good. Since the only place we've encountered this so far is when the field values were wholly duplicate, I'm not too worried. If users bump into this a lot and feel strongly about offering variations, we can always revisit.

philrz commented 3 years ago

Verified in zq commit b70cbfb.

Repeating the original repro steps:

$ zq -version
Version: v0.29.0-314-gb70cbfb2

$ cat has_dup.ndjson 
{"foo": 0}
{"foo": 1, "bar": 2, "foo": 3}
{"foo": 2}

$ zq -f ndjson -i ndjson has_dup.ndjson
{"foo":0}
{"bar":2,"foo":1}
{"foo":2}

As noted by @mccanne above, the current implementation opts to keep the first reference (rather than the last one like jq does). But per the opening remarks, it's not like the spec speaks of one true behavior, so we should be fine with this approach, at least at first. That's especially true for the one specific place we've bumped into this, which is with Suricata bug 4016 where the value is also duplicated, so picking "first" or "last" would both address the symptom.

Thanks @mccanne!

philrz commented 1 year ago

We recently had some discussion about if there's more that the Zed tools could do to help users tame incoming JSON data that have these duplicate keys (see #4609). That caused me to notice that we've actually changed a bit in our behavior since what was shown in the last comment above when this issue was verified.

The "first value wins" treatment shown in the verification notes remained with us up through Zed commit b9ad3b1.

$ zq -version
Version: v0.33.0-209-gb9ad3b14

$ zq has_dup.ndjson
{foo:0}
{foo:1,bar:2}
{foo:2}

However, starting with & since the next commit 77c28b5 (which is associated with the changes in #3631), the treatment is now "last value wins".

$ zq -version
Version: v0.33.0-210-g77c28b59

$ zq has_dup.ndjson
{foo:0}
{foo:3,bar:2}
{foo:2}

This is probably for the best since that matches how other tools like jq treat such data.

$ cat has_dup.ndjson | jq -c .
{"foo":0}
{"foo":3,"bar":2}
{"foo":2}