adjust / parquet_fdw

Parquet foreign data wrapper for PostgreSQL
PostgreSQL License
333 stars 37 forks source link

auto load table definition #8

Closed benjdv closed 4 years ago

benjdv commented 5 years ago

Not an issue but a idea of improvement. Don't know if it's really possible but it will be useful to have the ability to auto detect/extract the parquet fields and automatically build the foreign table Example: CREATE FOREIGN TABLE domain_pqt () server parquet_srv OPTIONS ( filename 'myfile.parquet', AUTOFIELD );

zilder commented 5 years ago

Hi @benjdv,

this is actually a good idea. Postgres already has the IMPORT FOREIGN SCHEMA syntax for this kind of tasks and corresponding routine for custom FDW's. I think I'll add this feature when i have more time.

zilder commented 4 years ago

Hi @benjdv,

Here is an experimental branch implementing IMPORT FOREIGN SCHEMA command: https://github.com/adjust/parquet_fdw/tree/import_schema

It would be great if you could test it on your data.

zilder commented 4 years ago

Merged into the master branch

benjdv commented 4 years ago

Appreciate this new possibility and have some feedback and thinking : 1) Passing (by error) a file instead of a directory produce a lose of connection (an error message will be better)

2) The naming is not really practical : the .parquet file from different PARQUET directory may have same name (example : 0_0_0.parquet => tablename 0_0_0 which is very common). So it's always possible to import in an "temporary" schema then rename and move table in the real destination, but it's not really satisfactory. Postgres propose an OPTIONS parameters (NOT conform to SQL standard) that could allow to specify a prefix/suffix for the tablename, example:

IMPORT FOREIGN SCHEMA "/tmp/File_20180601_pqt" LIMIT TO ("0_0_0", "0_0_1")
FROM SERVER parquet_srv
INTO public
OPTIONS (prefix 'File_20180601_');
=>
File_20180601_0_0_0 (instead of 0_0_0)
File_20180601_0_0_1 (instead of 0_0_1)

3) In most of the case imported PARQUET have several subfile (.parquet), and the idea is to create an empty table with the correct column and create foreign table that inherit from this "mother" table. Maybe this will be facilitated with OPTIONS like:

benjdv commented 4 years ago

Pending possible improvement, I use my own plpgsql function to do the job. I make it available here, hoping that it can help someone:

CREATE OR REPLACE FUNCTION public.import_parquet(
    parquet_path text,
    inherits_from text,
    id_parquet text DEFAULT ''::text,
    drop_table boolean DEFAULT false,
    create_table boolean DEFAULT true,
    stats_table boolean DEFAULT false)
  RETURNS void AS
$BODY$
-- Import all *.parquet from an Apache Parquet file with unheritance of a (Mother) specified table
-- If Mother table does not exists, it is created based on the first .parquet analysed
--
-- parquet_path  : path of the Apache Parquet file
-- inherits_from : inherited table
-- id_parquet    : prefix the tablename generated from the .parquet (with id_parquet="exp", /example/test_pqt/t1.parquet => exp_t1 table)
-- drop_table    : (if true) drop if exists, each table referent from Parquet file
-- create_table  : (if true) create each table from Parquet file
-- stats_table   : (if true) make analyze on each table referent from Parquet file
--
-- using example : SELECT * FROM public.import_parquet('/tmp/Identity/2018/201806/File_pqt','Identity', 'File201806', false, true, true);
DECLARE
 path TEXT;
 fn TEXT;
 foreign_name TEXT;
 id text;
 b boolean DEFAULT true;
BEGIN
 CREATE TEMP TABLE t_files(filename text) ON COMMIT DROP;
 EXECUTE format($$COPY t_files FROM PROGRAM 'find %s -maxdepth 1 -type f -name "*.parquet"'$$, parquet_path);

 FOR path IN (SELECT filename FROM t_files ) LOOP
  fn:=(regexp_match(path,'(?:.*/|^)([^/]*).parquet'))[1];
  foreign_name:=inherits_from || '_' || id_parquet || '_' || fn;

  IF drop_table THEN
   EXECUTE 'DROP FOREIGN TABLE IF EXISTS ' || quote_ident(foreign_name);
  END IF;

  IF create_table THEN
   /* Create Mother table if needed */
   if b THEN
    b:=false;
    IF to_regclass(inherits_from) IS NULL THEN /* Mother table doesn't exists */
     RAISE NOTICE 'TABLE % does not exists - Table will be create with schema of %', inherits_from, path;
     id:='import_tmp_' || pg_backend_pid()::text || '_' || txid_current()::text;
     EXECUTE 'CREATE SCHEMA ' || id || ';'; /* Create a "temporary" schema to isolate following operations */

     EXECUTE 'IMPORT FOREIGN SCHEMA ' || quote_ident(parquet_path)
     || ' LIMIT TO (' || quote_ident(fn) || ')'
     || ' FROM SERVER parquet_srv INTO ' || id || ';';
     EXECUTE 'CREATE TABLE public.' || quote_ident(inherits_from) || '(LIKE ' || id || '.' || quote_ident(fn) || ')' ;
     EXECUTE 'DROP SCHEMA ' || id || ' CASCADE;'; /* Drop temporary schema */ 
    END IF;   
   END IF;

   EXECUTE 'CREATE FOREIGN TABLE ' || quote_ident(foreign_name) || ' ()'
   || ' INHERITS (' || quote_ident(inherits_from) || ')'
   || ' SERVER parquet_srv'
   || ' OPTIONS ('
   || '  filename ''' || path || ''''
   || ' )';
   EXECUTE 'COMMENT ON FOREIGN TABLE public.' || quote_ident(foreign_name) || ' IS ''inherits from ' || quote_ident(inherits_from) || '''';
  END IF;

  IF stats_table THEN
   EXECUTE 'ANALYSE ' || quote_ident(foreign_name);
  END IF;
  --RAISE INFO '% done', path;
 END LOOP;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE STRICT
  COST 1000;
ALTER FUNCTION public.import_parquet(text, text, text, boolean, boolean, boolean)
  OWNER TO postgres;
zilder commented 4 years ago
  1. Passing (by error) a file instead of a directory produce a lose of connection (an error message will be better)

It actually was a bug, fixed it in the master.

As for 2 and 3 i'm not sure it should be a part of fdw. There could be different use-cases and they can actually be handled by a plpgsql function (like the one you posted earlier). For example, you want new foreign tables to inherit some parent table, someone else might want them instead to be attached to partitioned table or add extra constraints to each table. Or what if imported tables have different set of columns? And so on. I would rather keep interface simple and let user to implement extra logic on their own.

benjdv commented 4 years ago

It makes senses. I agree with you that there is some various needs and usages. All of them can of course be solved by (plpgsql) functions. And there probably bigger priority today than these points (like support of struct).

I just want to give few ideas for optional possibility that could help to save time and make it easy to adopt for (new) users.