benbernard / RecordStream

commandline tools for slicing and dicing JSON records.
Other
300 stars 31 forks source link

Multiplex output to files? #59

Open tsibley opened 9 years ago

tsibley commented 9 years ago

Occasionally I reach for recs multiplex when I want to split a record stream into multiple files. For example, recs multiplex -k foo -- recs-tocsv works great, except all of the CSV output goes to stdout. When I'm using an output format without a distinct marker to split on, I usually work around this limitation using some combination of recs piped to parallel running the recs to... command. recs chain and recs generate seem like they would almost allow me to multiplex to separate files, but either generate needs to support outputting non-records or chain needs to support some sort of interpolation like generate (ick).

In terms of supporting this feature, I see two options:

  1. Build support into multiplex itself. Something like --output-filename-key=<keyspec> or --output-filename=<snippet> on which output is written to for each group. The filename key or evaluated snippet would be added to the set of keys records are grouped upon.
  2. Add a new operation which enables use of the existing multiplex to do this, for example: recs multiplex -k foo -- recs-tofiles -k filename -- recs-tocsv

I think option one is cleaner than option two, both in terms of implementation and command line syntax. Option two however is implementable outside of core recs.

Is this feature worth having in core recs? General thoughts?

benbernard commented 9 years ago

Hmmm I'm cool with having it in core recs, just not certain what the interface should be... I think it seems reasonable to me to have multiplex be able to do it...

Another options would be to add a -o flag to all recs commands, like --filename-key that lets you output to a named file (which seems reasonable) and then let multiplex be able to interpolate command names based on a clumping record...

Would be cool to have the latter, but the former is much more usable. I'll also ping @amling to see what he thinks

amling commented 9 years ago

I'm also not sure the right combination of primitives to pull this off, but here are some ideas:

Something we've thought about previously was having line output commands take a --records to output a single key ("LINE" or the like) record instead. This means inside multiplex you'd get your bucket stamped back on that so recs-multiplex -k foo -- recs-tocsv --records would have output records with a "foo" field and a "LINE" field.

That's sort of the minimum of multiplex+tocsv not destroying the data. After that the best primitive to sort into files is not very clear, especially because you want to sort into file by "foo" field but then also eval down to the "LINE" field. That alone doesn't seem like a great primitive, but maybe it would be OK? `recs-tofiles --file

` would write records (not what you want here but we'd allow it), `--line ` would write the evaluation of the snippet. End-to-end this makes it: > recs-multiplex -k foo -- recs-tocsv --records | recs-tofiles --file '{{foo}}' --line '{{LINE}}' We could also split --file into --file-key (-f) and --file-eval (-F) and likewise --line into --line-key (-l) and --line-eval (-L): > recs-multiplex -k foo -- recs-tocsv --records | recs-tofiles -f foo -l LINE Keith On Wed, May 06, 2015 at 10:31:11AM -0700, Ben Bernard wrote: > Hmmm I'm cool with having it in core recs, just not certain what the interface should be... I think it seems reasonable to me to have multiplex be able to do it... > > Another options would be to add a -o flag to all recs commands, like --filename-key that lets you output to a named file (which seems reasonable) and then let multiplex be able to interpolate command names based on a clumping record... > > Would be cool to have the latter, but the former is much more usable. I'll also ping @amling to see what he thinks > > --- > > Reply to this email directly or view it on GitHub: > https://github.com/benbernard/RecordStream/issues/59#issuecomment-99545241
benbernard commented 9 years ago

Keith and I talked about this for a long while today... we think probably the best thing to do is to build it into multiplex...

recs-multiplex -k foo -o foo -- recs-tocsv

would output to a file named foo-FOO_VALUE for each clump, with the output of tocsv

Similarly you could use -O to provide evalable perl to generate the filename

recs-multiplex -k foo -O '"myawesomefile-{{foo}}.recs"'

We thought about tofiles for a long time, but in the end it just seemed to be duplicating multiplex clumping without much value....

Thoughts?

tsibley commented 9 years ago

Sounds good! I agree about duplicating the multiplex clumping without much value, and that's why I also had settled on option one instead of option two when thinking this through.

Unless you or Keith have a burning desire to implement this, I'll probably take a swing at it in the next few weeks.