Open tsibley opened 9 years ago
Hmmm I'm cool with having it in core recs, just not certain what the interface should be... I think it seems reasonable to me to have multiplex be able to do it...
Another options would be to add a -o flag to all recs commands, like --filename-key that lets you output to a named file (which seems reasonable) and then let multiplex be able to interpolate command names based on a clumping record...
Would be cool to have the latter, but the former is much more usable. I'll also ping @amling to see what he thinks
I'm also not sure the right combination of primitives to pull this off, but here are some ideas:
Something we've thought about previously was having line output commands
take a --records to output a single key ("LINE" or the like) record
instead. This means inside multiplex you'd get your bucket stamped back
on that so recs-multiplex -k foo -- recs-tocsv --records
would have
output records with a "foo" field and a "LINE" field.
That's sort of the minimum of multiplex+tocsv not destroying the data. After that the best primitive to sort into files is not very clear, especially because you want to sort into file by "foo" field but then also eval down to the "LINE" field. That alone doesn't seem like a great primitive, but maybe it would be OK? `recs-tofiles --file
Keith and I talked about this for a long while today... we think probably the best thing to do is to build it into multiplex...
recs-multiplex -k foo -o foo -- recs-tocsv
would output to a file named foo-FOO_VALUE for each clump, with the output of tocsv
Similarly you could use -O to provide evalable perl to generate the filename
recs-multiplex -k foo -O '"myawesomefile-{{foo}}.recs"'
We thought about tofiles for a long time, but in the end it just seemed to be duplicating multiplex clumping without much value....
Thoughts?
Sounds good! I agree about duplicating the multiplex clumping without much value, and that's why I also had settled on option one instead of option two when thinking this through.
Unless you or Keith have a burning desire to implement this, I'll probably take a swing at it in the next few weeks.
Occasionally I reach for
recs multiplex
when I want to split a record stream into multiple files. For example,recs multiplex -k foo -- recs-tocsv
works great, except all of the CSV output goes to stdout. When I'm using an output format without a distinct marker to split on, I usually work around this limitation using some combination ofrecs
piped toparallel
running therecs to...
command.recs chain
andrecs generate
seem like they would almost allow me to multiplex to separate files, but eithergenerate
needs to support outputting non-records orchain
needs to support some sort of interpolation likegenerate
(ick).In terms of supporting this feature, I see two options:
multiplex
itself. Something like--output-filename-key=<keyspec>
or--output-filename=<snippet>
on which output is written to for each group. The filename key or evaluated snippet would be added to the set of keys records are grouped upon.multiplex
to do this, for example:recs multiplex -k foo -- recs-tofiles -k filename -- recs-tocsv
I think option one is cleaner than option two, both in terms of implementation and command line syntax. Option two however is implementable outside of core recs.
Is this feature worth having in core recs? General thoughts?