BurntSushi / xsv

A fast CSV command line toolkit written in Rust.
The Unlicense
10.23k stars 317 forks source link

On motivation #302

Closed tastyminerals closed 2 years ago

tastyminerals commented 2 years ago

It's a bit sad to hear that you couldn't find a tool to handle 40GB of CSV data. Did you actually try https://github.com/eBay/tsv-utils

What's more, try and benchmark it against xsv ;)

tastyminerals commented 2 years ago

My bad, xsv started 2 years before that one.

BurntSushi commented 2 years ago

Yeah, exactly, not "sad" at all given that tsv-utils didn't exist back then.

Moreover, I wanted to work with csv, not tsv. tsv-utils does come with its own csv2tsv utility, so I'd have to run that first:

[andrew@duff openpolicing]$ time xsv count MA-clean.csv
3418298

real    0.449
user    0.411
sys     0.037
maxmem  5 MB
faults  0
[andrew@duff openpolicing]$ time csv2tsv MA-clean.csv | wc -l
3418299

real    0.548
user    0.421
sys     0.114
maxmem  5 MB
faults  0

real    0.548
user    0.008
sys     0.064
maxmem  5 MB
faults  0

So yeah, it looks decently fast.

While I don't know how tsv-utils works, it's worth noting that tsv data is commonly assumed to not have quoted data and that it simply does not contain delimiters at all:

By itself, using different field delimiters is not especially significant. Far more important is the approach to delimiters occurring in the data. CSV uses an escape syntax to represent commas and newlines in the data. TSV takes a different approach, disallowing TABs and newlines in the data.

And thus:

By contrast, parsing TSV data is simple.

And indeed, this is a categorical difference and opens up many more optimization techniques that simply aren't available for csv data.

So, you might say, "well just convert all your CSV files to TSV." Nope. Non-starter. CSVs can encode any kind of arbitrary data. But as mentioned above, TSVs as implemented by tsv-utils cannot. They cannot handle cases where tabs and newlines occur inside a field. Even its csv2tsv tool admits to this, and will replace those characters with spaces by default, thereby making the conversion lossy.

So tsv-utils is great if you have TSV data. I did not. I had CSV and that CSV was not clean and nice. tsv-utils, even if it existed back then, wouldn't have worked for me.

tastyminerals commented 2 years ago

I see. It makes sense now. Many things tend to look simple only at first glance.

BurntSushi commented 2 years ago

Indeed. I've learned that same lesson many times.