Closed philrz closed 3 years ago
@philrz it's a bit easier to take the first duplicate instead of the last, which is what I implemented in the PR, but we can change it to take the last if you think this is important.
@mccanne: Sounds good. Since the only place we've encountered this so far is when the field values were wholly duplicate, I'm not too worried. If users bump into this a lot and feel strongly about offering variations, we can always revisit.
Verified in zq
commit b70cbfb.
Repeating the original repro steps:
$ zq -version
Version: v0.29.0-314-gb70cbfb2
$ cat has_dup.ndjson
{"foo": 0}
{"foo": 1, "bar": 2, "foo": 3}
{"foo": 2}
$ zq -f ndjson -i ndjson has_dup.ndjson
{"foo":0}
{"bar":2,"foo":1}
{"foo":2}
As noted by @mccanne above, the current implementation opts to keep the first reference (rather than the last one like jq
does). But per the opening remarks, it's not like the spec speaks of one true behavior, so we should be fine with this approach, at least at first. That's especially true for the one specific place we've bumped into this, which is with Suricata bug 4016 where the value is also duplicated, so picking "first" or "last" would both address the symptom.
Thanks @mccanne!
We recently had some discussion about if there's more that the Zed tools could do to help users tame incoming JSON data that have these duplicate keys (see #4609). That caused me to notice that we've actually changed a bit in our behavior since what was shown in the last comment above when this issue was verified.
The "first value wins" treatment shown in the verification notes remained with us up through Zed commit b9ad3b1.
$ zq -version
Version: v0.33.0-209-gb9ad3b14
$ zq has_dup.ndjson
{foo:0}
{foo:1,bar:2}
{foo:2}
However, starting with & since the next commit 77c28b5 (which is associated with the changes in #3631), the treatment is now "last value wins".
$ zq -version
Version: v0.33.0-210-g77c28b59
$ zq has_dup.ndjson
{foo:0}
{foo:3,bar:2}
{foo:2}
This is probably for the best since that matches how other tools like jq
treat such data.
$ cat has_dup.ndjson | jq -c .
{"foo":0}
{"foo":3,"bar":2}
{"foo":2}
Repro is with
zq
commit 99477b6.Consider the following NDJSON:
At the moment, the presence of the duplicate field in the middle record causes
zq
to stop processing the stream entirely.When digging for background on the topic, I found my way to the JSON RFC 7159 which covers it pretty clearly:
By this definition, if we simply skipped over records that contain such duplicates we'd be in-line with the RFC behavior. However, in terms of what's best for the user, we should definitely not be bailing out on the whole stream. For instance, we bumped into this problem in brimdata/brimcap#13 where Suricata v6 was generating output with such duplicate fields. In that case it was causing the entire Zed processing of the log stream to fail and effectively broke the ability to import pcap data via Brimcap. Being tolerant of the input would have been an improvement.
To compare with another tool,
jq
happens to quietly tolerate the presence of the duplicated field, apparently taking the approach of reporting "the last name/value pair only".I'd advocate that we adapt the same behavior of
jq
, since it's tolerant and would help us work around the problem described in brimdata/brimcap#13. As a team we also discussed printing a warning when this happens.fixes https://github.com/brimdata/brimcap/issues/13