Open plainas opened 5 years ago
Is there a way to do this already?
I can't think of any simple way. But if xsv sort
grew a flag to shuffle the rows (analogous to sort
's -R/--random-sort
flag), then it would be a simple matter of a shuffle followed by xsv slice
.
I am tempted to use other command line tools to achieve this by partitioning lines rather than csv rows. Is there a way to escape new lines inside values so I ensure that each line of output is exactly one CSV row?
No. Not without layering your own encoding on top of CSV. If you need to handle arbitrary CSV data, then using other command line tools won't work. If you can guarantee that all CSV records occupy a single line, then other line oriented tools would work okay.
@plainas This may or may not help but a while ago I wrote a separate tool for doing this: https://github.com/sd2k/ttv
You can compose it with xsv
if desired, e.g. if you need to select columns etc.
@sd2k Neat tool, although it doesn't look like it correctly supports CSV data? I don't see any CSV parsing happening in that tool. (A single CSV record can span an arbitrary number of lines.)
Ah, I misread the initial description. You're right, that tool is completely naive when it comes to nested newlines. It could potentially be 'upgraded' if there's a need for it!
There definitely is :)
Y'all might consider my suggested implementation strategy. There's really no need for a separate tool for the stated use case. That is, all you need to do is add random sorting to xsv sort
. Once you have that, you can dice it up any way you want. It should be fairly easy to implement using rand
's shuffle routine. PRs are welcome.
For those of us working machine learning, a feature to quickly divide the data set into training data and test data would be a really nice to have. Is there a way to do this already?
I am tempted to use other command line tools to achieve this by partitioning lines rather than csv rows. Is there a way to escape new lines inside values so I ensure that each line of output is exactly one CSV row?