Closed tastyminerals closed 2 years ago
My bad, xsv started 2 years before that one.
Yeah, exactly, not "sad" at all given that tsv-utils
didn't exist back then.
Moreover, I wanted to work with csv, not tsv. tsv-utils does come with its own csv2tsv
utility, so I'd have to run that first:
[andrew@duff openpolicing]$ time xsv count MA-clean.csv
3418298
real 0.449
user 0.411
sys 0.037
maxmem 5 MB
faults 0
[andrew@duff openpolicing]$ time csv2tsv MA-clean.csv | wc -l
3418299
real 0.548
user 0.421
sys 0.114
maxmem 5 MB
faults 0
real 0.548
user 0.008
sys 0.064
maxmem 5 MB
faults 0
So yeah, it looks decently fast.
While I don't know how tsv-utils
works, it's worth noting that tsv data is commonly assumed to not have quoted data and that it simply does not contain delimiters at all:
By itself, using different field delimiters is not especially significant. Far more important is the approach to delimiters occurring in the data. CSV uses an escape syntax to represent commas and newlines in the data. TSV takes a different approach, disallowing TABs and newlines in the data.
And thus:
By contrast, parsing TSV data is simple.
And indeed, this is a categorical difference and opens up many more optimization techniques that simply aren't available for csv data.
So, you might say, "well just convert all your CSV files to TSV." Nope. Non-starter. CSVs can encode any kind of arbitrary data. But as mentioned above, TSVs as implemented by tsv-utils
cannot. They cannot handle cases where tabs and newlines occur inside a field. Even its csv2tsv tool admits to this, and will replace those characters with spaces by default, thereby making the conversion lossy.
So tsv-utils
is great if you have TSV data. I did not. I had CSV and that CSV was not clean and nice. tsv-utils
, even if it existed back then, wouldn't have worked for me.
I see. It makes sense now. Many things tend to look simple only at first glance.
Indeed. I've learned that same lesson many times.
It's a bit sad to hear that you couldn't find a tool to handle 40GB of CSV data. Did you actually try https://github.com/eBay/tsv-utils
What's more, try and benchmark it against xsv ;)