Closed pmenon closed 6 years ago
@tcm-marcel I didn't use mmap
mostly because it wouldn't help performance. This is a giant sequential read so IO prefetching will kick in. If not, we can fadvise
a hint. We're doing read()
which bypasses the page cache, avoiding duplicated data.
There is an argument in software engineering simplicity using mmap
, but the interaction would be similar - instead of copying from a temporary buffer managed by us to a line buffer, we'd be copying from a temporary kernel mmap
buffer into our line buffer.
@pmenon Thank you for the explanation! I didn't know read()
bypasses the page cache.
@tcm-marcel I misspoke. read()
will use the page cache unless we use direct IO - we don't do this. I was thinking of fread()
that does library-level buffering that we avoid.
Summary
This PR adds support for psql's
COPY
command for bulk loading CSV files into the database. UsingCOPY
, one can quickly load millions of rows into a table in a few seconds. For example, I was able to load 20M rows into a table with four integer columns in under three seconds; I also loaded an SF-1lineitem
table from TPC-H in about 20 seconds (which isn't great, but faster than through oltpbench). I find it very convenient to use this to quickly load a crap-tonne of data into the database fast to do benchmarks.Right now, we only support CSV files, but the quoting, escaping, and delimiter characters can be configured. I tried to make the parser fairly robust to erroneous files, but we're not as generous as Postgres (which goes to great lengths to try to understand your CSV).
Modifications
format
,delimiter
,escape
, andquote
characters.ExternalFileScan
andExportExternalFile
operators to optimizer for copy-from and copy-to.CSVScanPlan
andExportExternalFilePlan
to planner.CSVScanTranslator
to codegen. Also added runtime helperCSVScanner
that accepts a callback function to invoke per-row in the CSV.functions
namespace.Reviewers