benbernard / RecordStream

commandline tools for slicing and dicing JSON records.
Other
300 stars 31 forks source link

Optionally preserve "natural" key ordering from sources with headers #43

Open tsibley opened 10 years ago

tsibley commented 10 years ago

When the source of a record stream supports some natural key ordering, it'd be nice to optionally support retaining that order. Both recs-fromcsv and recs-fromsplit support the --header option which could preserve the field order as found in the first line of the input. With the natural order retained, various stream output operations can preferentially use it when no specific fields are specified, i.e. a bare ... | recs-tocsv or ... | recs-totable could use it but ... | recs-tocsv -k foo,bar wouldn't. This feature would make general filtering of data sets easier by removing the need to track external to the pipeline what fields you got at input to ensure they're output again.

Ideally it'd be general enough to be applied to any input operation and also be possible to add to records ad-hoc via recs-xform or similar operations for use later in the pipeline.

Since there's no stream-level metadata, we're limited to stashing this ordering information on each record, perhaps under a key like __field_order or __fields. Output operations can examine the first record for the stashed order. It's not the prettiest solution technically, so I'd love if someone had a better idea.

Does this seem reasonable?

benbernard commented 10 years ago

It seems like a good idea... @amling and I have discussed a bunch of different approaches. I'm not entirely thrilled with the idea of a pass through field, but I'm okay with it... The other idea we came up with, which we've thought about implementing but have never gotten around to is a out-of-band communication bus, like a socket or something that would let all the processes in a record stream pipe chain communicate to each other... But that is fairly complicated to get working in a easy fashion.

I'd support something like __recordstream_config as the key name in the first record, and we can put whatever datastructure we want under that key (inlcuding field_order , or other keys.

I do think that some good features (like this one) need this type of communication, we've been stymied several times without it...

tsibley commented 10 years ago

Nod, I tend to agree. It might be difficult to manage the lifetime of the communication bus relative to all the processes in a pipeline too, especially if there's a point when earlier procs which created the bus are done writing output and exit before a subsequent piece of the pipeline wants to read from the bus. Race conditions seem to abound unless done carefully, and it loses the metadata if output is directed to a temporary file to be processed later instead of immediately forwarded to another proc.

Using __recordstream_config seems reasonable to me, but it'll need to be on all records, not just the first. Otherwise there's the danger of filtering out the first record. Even so, there remain a few complications preserving __recordstream_config through, for example, a recs-collate too.

benbernard commented 10 years ago

yeah, I figured we could do something like just have it on the first record, if present, then modify it however this script wants to and put it on the first output record... you could also have a recs-runwithoutconfig or something that would run an arbitrary unix process assuming JSON in and out, and undecorate, then redecorate on the other side, so then you could do something like recs-runwithoutconfig grep recordstream and be sure to not get a result for the first line.

I'm not in love with this idea, but I really don't like adding the key to every record, because of stream size an re-serialization and all that, but I could be convinced...

Talked to @amling about this yesterday, and he seemed to be more in favor of doing some kind of versioning of the streams with an initial line that was just the record stream config info, but I'm not in love with this either (especially considering the very norm recs... | grep construct that that would break).

I'm getting more convinced that out of band communication is a good idea...

tsibley commented 10 years ago

Oh, I see what you mean by only adding it to the first record and each command playing hot potato to preserve it.

All of the options so far leave a somewhat unpleasant aftertaste. Adding a header to the stream (just another JSON record, presumably, but with some sort of special flag) has its benefits, but it also corrupts the simplicity of "one record per line, every line is an input record, period." Maybe that's not such a problem though... FWIW, I'm more likely to use recs-grep in a stream than normal grep, if solely so I can target the precise field I care about.

Out of band communication could work really well with a bit of effort, but the non-starter for me is that you lose it as soon as the output is to a file or other non-recs process. I use intermediate cache files pretty often.

Hrm.

benbernard commented 10 years ago

Yeah, I've convinced myself that implementing the out of band communication is possible, but I'm no longer convinced that it will be useful given your idea/diesire to have it work through writing to a file.

I was thinking maybe enable it with a flag and put it on each record, maybe default it with an env variable that people can set, but that stuff gets cumbersome quickly, and its a fairly ugly solution, but maybe okay given all the constraints.

tsibley commented 10 years ago

Nod. The more I consider it, the more an optional stream header — just another JSON record, but tagged either by a reserved key ("__recs_config": true) or prefixed "magic string" (recs:{ … }) — makes sense to me.

Benefits:

Downsides:

I imagine it would be enabled by a flag, at least initially, in the various recs-from* operations. All of the other operations can be aware of it and pass it along unmolested if it exists in the stream. By default, adding it can be disabled.

A new operation — recs-withoutmetadata or recs-withoutconfig or recs-withoutheader or s/without/no/ for any of those — can be made for the times when you want to pipe through something external and be absolutely sure the header doesn't interfere.

That's my current 2¢.

benbernard commented 10 years ago

Yeah, I think this approach makes a lot of sense, though I'd like to get amling's buy in On Apr 10, 2014 7:26 PM, "Thomas Sibley" notifications@github.com wrote:

Nod. The more I consider it, the more an optional stream header -- just another JSON record, but tagged either by a reserved key ("__recs_config": true) or prefixed "magic string" (recs:{ ... }) -- makes sense to me.

Benefits:

  • It doesn't pollute each record.
  • It stays with the stream.
  • It's logically per-stream instead of appearing per-record but only being respected for the first record
  • It's pretty unambiguous.

Downsides:

  • Enabling it by default may break existing pipelines with non-recs components.
  • The stream "format" becomes a little more complicated.

I imagine it would be enabled by a flag, at least initially, in the various recs-from* operations. All of the other operations can be aware of it and pass it along unmolested if it exists in the stream. By default, adding it can be disabled.

A new operation -- recs-withoutmetadata or recs-withoutconfig or recs-withoutheader or s/without/no/ for any of those -- can be made for the times when you want to pipe through something external and be absolutely sure the header doesn't interfere.

That's my current 2¢.

Reply to this email directly or view it on GitHubhttps://github.com/benbernard/RecordStream/issues/43#issuecomment-40165188 .

tsibley commented 10 years ago

Sounds good to me!

tsibley commented 8 years ago

Thinking about this some more recently. Instead of taking the route of explicit stream metadata (and the wide scope of work that entails), I wonder if we should merely preserve key order for records. Hash::Ordered may be a quite reasonable Perl-side option, for example. I'm not sure what the landscape looks like for Perl-to-JSON and vice versa.

It would be slower, although how much relative to the rest of recs I don't know. I imagine preserving ordering would be opt-in anyway, either by command line flags or an environment variable (more useful given global implications in a pipeline of recs commands).

Thoughts?

benbernard commented 8 years ago

I'm totally happy to have a order-preserving hash semantic for keys, if we can figure out the JSON side of it (combined with the perl side).

Would probably be best to have it optional, but I could see enabling it generically too