Read-only Apache Parquet foreign data wrapper for PostgreSQL.
parquet_fdw
requires libarrow
and libparquet
installed in your system (requires version 0.15+, for previous versions use branch arrow-0.14). Please refer to libarrow installation page or building guide.
To build parquet_fdw
run:
make install
or in case when PostgreSQL is installed in a custom location:
make install PG_CONFIG=/path/to/pg_config
It is possible to pass additional compilation flags through either custom
CCFLAGS
or standard PG_CFLAGS
, PG_CXXFLAGS
, PG_CPPFLAGS
variables.
After extension was successfully installed run in psql
:
create extension parquet_fdw;
To start using parquet_fdw
one should first create a server and user mapping. For example:
create server parquet_srv foreign data wrapper parquet_fdw;
create user mapping for postgres server parquet_srv options (user 'postgres');
Now you should be able to create foreign table for Parquet files.
create foreign table userdata (
id int,
first_name text,
last_name text
)
server parquet_srv
options (
filename '/mnt/userdata1.parquet'
);
Currently parquet_fdw
supports the following column types:
Arrow type | SQL type |
---|---|
INT8 | INT2 |
INT16 | INT2 |
INT32 | INT4 |
INT64 | INT8 |
FLOAT | FLOAT4 |
DOUBLE | FLOAT8 |
TIMESTAMP | TIMESTAMP |
DATE32 | DATE |
STRING | TEXT |
BINARY | BYTEA |
LIST | ARRAY |
MAP | JSONB |
Currently parquet_fdw
doesn't support structs and nested lists.
Foreign table may be created for a single Parquet file and for a set of files. It is also possible to specify a user defined function, which would return a list of file paths. Depending on the number of files and table options parquet_fdw
may use one of the following execution strategies:
Strategy | Description |
---|---|
Single File | Basic single file reader |
Multifile | Reader which process Parquet files one by one in sequential manner |
Multifile Merge | Reader which merges presorted Parquet files so that the produced result is also ordered; used when sorted option is specified and the query plan implies ordering (e.g. contains ORDER BY clause) |
Caching Multifile Merge | Same as Multifile Merge , but keeps the number of simultaneously open files limited; used when the number of specified Parquet files exceeds max_open_files |
Following table options are supported:
ORDER BY
clause or in other cases when having a presorted set is beneficial (Group Aggregate, Merge Join);filename
or returned by files_func
are ordered according to sorted
option and have no intersection rangewise; this allows to use Gather Merge
node on top of parallel Multifile scan (default false
);false
);false
);JSONB
argument and return text array of full paths to parquet files;GUC variables:
true
);true
).true
).parquet_fdw
also supports parallel query execution (not to confuse with multi-threaded decoding feature of Apache Arrow).
parquet_fdw
also supports IMPORT FOREIGN SCHEMA
command to discover parquet files in the specified directory on filesystem and create foreign tables according to those files. It can be used as follows:
import foreign schema "/path/to/directory"
from server parquet_srv
into public;
It is important that remote_schema
here is a path to a local filesystem directory and is double quoted.
Another way to import parquet files into foreign tables is to use import_parquet
or import_parquet_explicit
:
create function import_parquet(
tablename text,
schemaname text,
servername text,
userfunc regproc,
args jsonb,
options jsonb)
create function import_parquet_explicit(
tablename text,
schemaname text,
servername text,
attnames text[],
atttypes regtype[],
userfunc regproc,
args jsonb,
options jsonb)
The only difference between import_parquet
and import_parquet_explicit
is that the latter allows to specify a set of attributes (columns) to import. attnames
and atttypes
here are the attributes names and attributes types arrays respectively (see the example below).
userfunc
is a user-defined function. It must take a jsonb
argument and return a text array of filesystem paths to parquet files to be imported. args
is user-specified jsonb object that is passed to userfunc
as its argument. A simple implementation of such function and its usage may look like this:
create function list_parquet_files(args jsonb)
returns text[] as
$$
begin
return array_agg(args->>'dir' || '/' || filename)
from pg_ls_dir(args->>'dir') as files(filename)
where filename ~~ '%.parquet';
end
$$
language plpgsql;
select import_parquet_explicit(
'abc',
'public',
'parquet_srv',
array['one', 'three', 'six'],
array['int8', 'text', 'bool']::regtype[],
'list_parquet_files',
'{"dir": "/path/to/directory"}',
'{"sorted": "one"}'
);