alitrack / duckdb_fdw

DuckDB Foreign Data Wrapper for PostgreSQL
MIT License
272 stars 18 forks source link

[question] Correct way to integrate with external database in remote storage (GCS) #42

Closed Mmoncadaisla closed 3 months ago

Mmoncadaisla commented 3 months ago

Context

Hi! we're attempting to run DuckDB via duckdb_fdw extension inside a PostgreSQL database that is running in our Kubernetes cluster via CNPG operator.

Our ultimate goal is to be able to leverage duckdb to manage data that is currently stored in (geo)parquet files in our Google Cloud Storage buckets.

The database pods already have the GOOGLE_APPLICATION_CREDENTIALS set and we've verified that accessing data in a GCS bucket is possible (we did this via raster2pgpsql, using out-db COG rasters that are stored in GCS)

We strongly prefer to have all the duckdb storage remotely (either having the database itself stored in GCS or working in-memory only and persisting data to .parquet files)

Problem description

UPDATE: I was able to load the extension using the ':memory:' or directory approach locally directly using the image from the shared Dockerfile without issues. One difference between both is that the base path in the Docker container is /var/lib/postgresql/data/ whereas in the CNPG pod it is /var/lib/postgresql/data/pgdata

We're able to load the extension and create a foreign server without issues, see below:

A) Creating the server using the in-memory option:

SELECT current_user;
--   current_user
--  --------------
--   postgres
--  (1 row)

-- DROP SERVER IF EXISTS duckdb_svr CASCADE;

CREATE SERVER duckdb_svr FOREIGN DATA WRAPPER duckdb_fdw OPTIONS (database ':memory:');
-- CREATE SERVER

SELECT duckdb_execute('duckdb_svr','CREATE TABLE test (a INTEGER, b INTEGER, c VARCHAR(10));');
--  duckdb_execute
--  ----------------
--  
--  (1 row)

CREATE FOREIGN TABLE public.terraclimate_duckdb_ft (
    h3             bigint,
    value          numeric
)
SERVER duckdb_svr
OPTIONS (
    table 'read_parquet("gs://cambium-earth-general/parquet/terraclimate/*.parquet")'
);
-- CREATE FOREIGN TABLE

SELECT duckdb_execute('duckdb_svr','CREATE PERSISTENT SECRET (TYPE GCS, KEY_ID ''GOOG1EXXXXXXXXXX'', SECRET ''XXXXXXXXXX'');');
-- ERROR:  SQL error during prepare: Extension Autoloading Error: An error occurred while trying to automatically install the required extension 'httpfs':
-- Failed to create directory "/var/lib/postgresql/.duckdb/"! CREATE PERSISTENT SECRET (TYPE GCS, KEY_ID 'GOOG1EXXXXXXXXXX', SECRET 'XXXXXXXXXX');

B) Creating the server using local storage

SELECT current_user;
-- current_user
--  --------------
--   postgres
--  (1 row)

-- DROP SERVER IF EXISTS duckdb_svr CASCADE;

CREATE SERVER duckdb_svr FOREIGN DATA WRAPPER duckdb_fdw OPTIONS (database 'db');
-- CREATE SERVER

SELECT duckdb_execute('duckdb_svr','CREATE TABLE test (a INTEGER, b INTEGER, c VARCHAR(10));');
--  duckdb_execute
--  ----------------
--  
--  (1 row)

CREATE FOREIGN TABLE public.terraclimate_duckdb_ft (
    h3             bigint,
    value          numeric
)
SERVER duckdb_svr
OPTIONS (
    table 'read_parquet("gs://cambium-earth-general/parquet/terraclimate/*.parquet")'
);
-- CREATE FOREIGN TABLE

SELECT duckdb_execute('duckdb_svr','CREATE PERSISTENT SECRET (TYPE GCS, KEY_ID ''GOOG1EXXXXXXXXXX'', SECRET ''XXXXXXXXXX'');');
-- ERROR:  SQL error during prepare: Extension Autoloading Error: An error occurred while trying to automatically install the required extension 'httpfs':
-- Failed to create directory "/var/lib/postgresql/.duckdb/"! CREATE PERSISTENT SECRET (TYPE GCS, KEY_ID 'GOOG1EXXXXXXXXXX', SECRET 'XXXXXXXXXX');

I believe that the 'Failed to create directory' error could be related to permissions to create the directory, perhaps related to CNPG internals.

C) Creating the server using remote storage

SELECT current_user;
-- current_user
--  --------------
--   postgres
--  (1 row)

-- DROP SERVER IF EXISTS duckdb_svr CASCADE;

CREATE SERVER duckdb_svr FOREIGN DATA WRAPPER duckdb_fdw OPTIONS (database 'gs://cambium-earth-general/test.db');
-- CREATE SERVER

SELECT duckdb_execute('duckdb_svr','CREATE TABLE test (a INTEGER, b INTEGER, c VARCHAR(10));');
-- ERROR:  failed to open SQLite DB. rc=1 path=gs://cambium-earth-general/test.db

Questions

  1. What is the recommended approach for leveraging a remote storage such as Google Cloud Storage as in this case, to store the database information instead of using a local postgres route?

  2. Is it possible to load remote extensions using the ':memory:' option?

Additional information

Versions

Dockerfile

Expand Dockerfile ```dockerfile # Specify the exact base image version for consistency FROM postgis/postgis:14-3.4 # Set environment variables ENV POSTGRES_VERSION 14 ENV POSTGIS_VERSION 3 # Do not split the description, otherwise we will see a blank space in the labels LABEL name="PostgreSQL + PostGIS Container Images" \ vendor="The CloudNativePG Contributors" \ version="${PG_VERSION}" \ release="42" \ summary="PostgreSQL + PostGIS Container images." \ description="This Docker image contains PostgreSQL, PostGIS and Barman Cloud based on Postgres 14-3.4." # Copy requirements files COPY requirements.txt / COPY requirements-udf.txt / # Install packages excluding cmake RUN apt-get update && \ apt-get install -y --no-install-recommends \ python3-pip python3-psycopg2 python3-setuptools wget make gcc \ libtool unzip pgxnclient libpq-dev postgresql-server-dev-${POSTGRES_VERSION} \ lsb-release \ ca-certificates build-essential autoconf automake zlib1g-dev \ libxml2-dev postgresql-server-dev-all git \ libboost-atomic1.74.0 libboost-chrono1.74.0 libboost-graph1.74.0 \ libboost-date-time1.74.0 libboost-program-options1.74.0 libboost-system1.74.0 \ libboost-thread1.74.0 libboost-dev && \ apt-get install -y python3 postgresql-plpython3-${POSTGRES_VERSION} && \ pip3 install --upgrade pip && \ pip3 install -r requirements-udf.txt && \ pip3 install --no-deps -r requirements.txt # Install additional extensions RUN set -xe; \ apt-get update; \ apt-get install -y --no-install-recommends \ "postgresql-${POSTGRES_VERSION}-pgaudit" \ "postgresql-${POSTGRES_VERSION}-pg-failover-slots" \ "postgresql-${POSTGRES_VERSION}-pgrouting" \ ; \ rm -fr /tmp/* ; \ rm -rf /var/lib/apt/lists/*; # Install barman-cloud RUN set -xe; \ apt-get update; \ apt-get install -y --no-install-recommends \ python3-pip \ python3-psycopg2 \ python3-setuptools \ ; \ pip3 install --upgrade pip; \ # TODO: Remove --no-deps once https://github.com/pypa/pip/issues/9644 is solved pip3 install --no-deps -r requirements.txt; \ rm -rf /var/lib/apt/lists/*; # Install TimescaleDB RUN echo "deb https://packagecloud.io/timescale/timescaledb/debian/ $(lsb_release -c -s) main" | tee /etc/apt/sources.list.d/timescaledb.list && \ wget --quiet -O - https://packagecloud.io/timescale/timescaledb/gpgkey | apt-key add - && \ apt-get update && \ apt-get install -y "timescaledb-2-postgresql-${POSTGRES_VERSION}" # Manually install a compatible version of cmake RUN apt-get remove -y cmake && \ wget https://github.com/Kitware/CMake/releases/download/v3.24.1/cmake-3.24.1-Linux-x86_64.sh -O /tmp/cmake-install.sh && \ chmod +x /tmp/cmake-install.sh && \ /tmp/cmake-install.sh --skip-license -- prefix=/usr/local && \ rm /tmp/cmake-install.sh && \ ln -sf /usr/local/bin/cmake /usr/bin/cmake # Clone h3-pg repository RUN git clone https://github.com/zachasme/h3-pg.git # Replace the FindPostgreSQL.cmake with the custom version COPY FindPostgreSQL.cmake /h3-pg/cmake/FindPostgreSQL.cmake # Build h3-pg RUN cd h3-pg && \ cmake -B build -DCMAKE_BUILD_TYPE=Release -DPostgreSQL_FIND_VERSION=14 && \ cmake --build build && \ cmake --install build --component h3-pg # Clone duckdb_fdw repository RUN git clone https://github.com/alitrack/duckdb_fdw # Adjust the path to pg_config as necessary ENV PG_CONFIG=/usr/lib/postgresql/14/bin/pg_config # duckdb foreign data wrapper installation RUN cd duckdb_fdw && \ wget -c https://github.com/duckdb/duckdb/releases/download/v0.10.1/libduckdb-linux-amd64.zip && \ unzip -o libduckdb-linux-amd64.zip && \ cp libduckdb.so $($PG_CONFIG --libdir) && \ make USE_PGXS=1 PG_CONFIG=$PG_CONFIG && \ make install USE_PGXS=1 PG_CONFIG=$PG_CONFIG # Install pointcloud extension RUN apt-get update \ && apt-get install -y --no-install-recommends \ git \ ca-certificates \ build-essential \ autoconf \ automake \ cmake \ zlib1g-dev \ postgresql-server-dev-all \ libxml2-dev \ && rm -rf /var/lib/apt/lists/* \ && git clone https://github.com/verma/laz-perf.git \ && cd laz-perf \ && cmake . \ && make \ && make install \ && cd .. \ && rm -r laz-perf \ && git clone https://github.com/pgpointcloud/pointcloud \ && cd pointcloud \ && ./autogen.sh \ && ./configure --with-lazperf=/usr/local --with-pgconfig=/usr/lib/postgresql/${POSTGRES_VERSION}/bin/pg_config CFLAGS="-Wall -Werror -O2 -g" \ && make \ && make install \ && apt-get purge -y --auto-remove \ git \ ca-certificates \ build-essential \ autoconf \ automake \ cmake \ zlib1g-dev \ postgresql-server-dev-all \ libxml2-dev # Final Cleanup RUN apt-get purge -y --auto-remove wget lsb-release && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* # Change the uid of postgres to 26 RUN usermod -u 26 postgres # Set user to postgres USER 26 ```
Mmoncadaisla commented 3 months ago

Ok, the issue seems to be with writing to the /var/lib/postgresql/ directory, DuckDB attempts to create a hidden directory named .duckdb within it, where it attempts to create the managed_secrets and extensions subfolders, without success when it runs within CNPG

Mmoncadaisla commented 3 months ago

Resolved by setting the following options (note that these are not directly allowed at the CREATE SERVER <> FOREIGN DATA WRAPPER <> duckdb_fdw OPTIONS syntax) thanks to the hint from the CNPG community.

SELECT duckdb_execute('duckdb_svr', 'SET extension_directory = ''/var/lib/postgresql/data/pgdata/.duckdb/extensions'';');
SELECT duckdb_execute('duckdb_svr', 'SET secret_directory = ''/var/lib/postgresql/data/pgdata/.duckdb/stored_secrets'';');