cmu-db / peloton

The Self-Driving Database Management System
http://pelotondb.io
Apache License 2.0
2.03k stars 623 forks source link

Support COPY for CSV files #1371

Closed pmenon closed 6 years ago

pmenon commented 6 years ago

Summary

This PR adds support for psql's COPY command for bulk loading CSV files into the database. Using COPY, one can quickly load millions of rows into a table in a few seconds. For example, I was able to load 20M rows into a table with four integer columns in under three seconds; I also loaded an SF-1 lineitem table from TPC-H in about 20 seconds (which isn't great, but faster than through oltpbench). I find it very convenient to use this to quickly load a crap-tonne of data into the database fast to do benchmarks.

Right now, we only support CSV files, but the quoting, escaping, and delimiter characters can be configured. I tried to make the parser fairly robust to erroneous files, but we're not as generous as Postgres (which goes to great lengths to try to understand your CSV).

Modifications

  1. Changes in the parser to look for format, delimiter, escape, and quote characters.
  2. Added logical and physical ExternalFileScan and ExportExternalFile operators to optimizer for copy-from and copy-to.
  3. Added CSVScanPlan and ExportExternalFilePlan to planner.
  4. Added CSVScanTranslator to codegen. Also added runtime helper CSVScanner that accepts a callback function to invoke per-row in the CSV.
  5. Added integer, decimal, string, and date parsing logic into the functions namespace.
  6. Lots of tests.

Reviewers

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-0.3%) to 77.066% when pulling ec280f99cfa6f91fc7baddc68bcb1874e8d516e0 on pmenon:csv into 881a8e6d34296d372593ac9714d6a71a5500f82c on cmu-db:master.

pmenon commented 6 years ago

@tcm-marcel I didn't use mmap mostly because it wouldn't help performance. This is a giant sequential read so IO prefetching will kick in. If not, we can fadvise a hint. We're doing read() which bypasses the page cache, avoiding duplicated data.

There is an argument in software engineering simplicity using mmap, but the interaction would be similar - instead of copying from a temporary buffer managed by us to a line buffer, we'd be copying from a temporary kernel mmap buffer into our line buffer.

tcm-marcel commented 6 years ago

@pmenon Thank you for the explanation! I didn't know read() bypasses the page cache.

pmenon commented 6 years ago

@tcm-marcel I misspoke. read() will use the page cache unless we use direct IO - we don't do this. I was thinking of fread() that does library-level buffering that we avoid.