dimitri / pgloader

Migrate to PostgreSQL in a single command!
http://pgloader.io
Other
5.45k stars 548 forks source link

Redshift syntax #860

Closed tendres closed 6 years ago

tendres commented 6 years ago

I am looking to migrate some tables from Redshift to Postgres and have been unable to get the syntax correct - is there any documentation I am missing? When I use redshift pgloader appears to believe it is an sqlite connection? Looking for a robust example to follow - any help appreciated!

LOAD DATABASE
FROM  redshift://USER:PASSWORD@ENDPOINT:PORT/DB_NAME
INTO postgresql://USER:PASSWORD@ENDPOINT:PORT/DB_NAME

WITH on error stop,
concurrency = 1,
workers = 8,
prefetch rows = 20000

SET PostgreSQL PARAMETERS
search_path to 'some_schema'

INCLUDING ONLY TABLE NAMES LIKE 'some_table';
 LOG Migrating from #<SQLITE-CONNECTION sqlite://.....

Code CANTOPEN: no message.
An unhandled error condition has been signalled:
   Failed to open sqlite file #P".....
dimitri commented 6 years ago

Well redshift uses the PostgreSQL protocol, so just use PostgreSQL:// for the connection string. pgloader auto-detects when it's talking to a redshift server rather than a PostgreSQL server. Also, you need to use the current git master version of pgloader (build from fresh sources) to have working Redshift support.

tendres commented 6 years ago

Ah good - thanks. Built from fresh sources. pgloader version "3.5.f8460c1" compiled with SBCL 1.3.1.debian

Now getting the following error from Redshift:

ERROR Database error 42883: function string_agg(name, "unknown") does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
QUERY: -- params: including
--         filter-list-to-where-clause for including
--         excluding
--         filter-list-to-where-clause for excluding
  select n.nspname,
         i.relname,
         i.oid,
         rn.nspname,
         r.relname,
         indisprimary,
         indisunique,
         (select string_agg(attname, ',')
            from pg_attribute
           where attrelid = r.oid
             and array[attnum::integer] <@ indkey::integer[]
         ) as cols,
         pg_get_indexdef(indexrelid),
         c.conname,
         pg_get_constraintdef(c.oid)
    from pg_index x
         join pg_class i ON i.oid = x.indexrelid
         join pg_class r ON r.oid = x.indrelid
         join pg_namespace n ON n.oid = i.relnamespace
         join pg_namespace rn ON rn.oid = r.relnamespace
         left join pg_depend d on d.classid = 'pg_class'::regclass
                              and d.objid = i.oid
                              and d.refclassid = 'pg_constraint'::regclass
                              and d.deptype = 'i'
         left join pg_constraint c ON c.oid = d.refobjid
   where n.nspname !~ '^pg_' and n.nspname <> 'information_schema'

order by n.nspname, r.relname

When using:

LOAD DATABASE
FROM  postgresql://USER:PASSWORD@ENDPOINT:PORT/DB_NAME
INTO postgresql://USER:PASSWORD@ENDPOINT:PORT/DB_NAME

WITH on error stop,
concurrency = 1,
workers = 8,
prefetch rows = 20000

SET PostgreSQL PARAMETERS
search_path to 'some_schema'

INCLUDING ONLY TABLE NAMES MATCHING 'some_table' IN SCHEMA 'some_schema';

Maybe LISTAGG instead of string_agg for Redshift? LISTAGG Function

tendres commented 6 years ago

Forgot to mention the Redshift instance is still up with your same credentials if you need a place to test.

dimitri commented 6 years ago

Thanks for the notice about the testing environment, then I realised I could probably easily fix the problem at hand... and I think I did. Please try again from a fresh build.

tendres commented 6 years ago

@dimitri So close! Now an error about the function identity on the postgres side:

2018-11-08T16:02:19.468000Z ERROR Database error 42883: function identity(integer, integer, text) does not exist
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
QUERY: CREATE TABLE "staging"."work" 
(
  "id"             integer default "identity"(137710, 0, '1,1'::text),
  "source_id"      integer default NULL,
  "data_source_id" integer default NULL,
  "name"           text default NULL,
  "created_at"     timestamp without time zone default CURRENT_TIMESTAMP
);
2018-11-08T16:02:19.471000Z FATAL Failed to create the schema, see above.
2018-11-08T16:02:19.475000Z LOG report summary reset
       table name     errors       rows      bytes      total time
-----------------  ---------  ---------  ---------  --------------
  fetch meta data          0          3                     0.156s
   Create Schemas          0          0                     0.002s
 Create SQL Types          0          0                     0.009s
    Create tables          0          0                     0.000s
-----------------  ---------  ---------  ---------  --------------
-----------------  ---------  ---------  ---------  --------------
tendres commented 6 years ago

The DDL from Redshift and RDS look identical other than auto-incremental stuff. Redshift:

CREATE TABLE "staging"."work" (
  "id" int4 NOT NULL DEFAULT "identity"(137710, 0, '1,1'::text),...

RDS:

   CREATE TABLE "staging"."work" (
  "id" int4 NOT NULL DEFAULT nextval('"staging".work_id_seq'::regclass),...
dimitri commented 6 years ago

This table staging.work isn't part of the test case you gave me access to. Can you add it, so that I can reproduce and fix the error? Thanks.

tendres commented 6 years ago

I created a table staging.work containing 1000 records and a view to look at the DDL of any existing tables. By running:

SELECT
    ddl 
FROM
    admin.v_generate_tbl_ddl 
WHERE
    schemaname = 'staging' 
    AND tablename = 'work';

returns

CREATE TABLE IF NOT EXISTS staging."work"
(
    id INTEGER NOT NULL DEFAULT "identity"(347358, 0, ('1,1'::character varying)::text) ENCODE lzo
    ,source_id INTEGER NOT NULL  ENCODE lzo
    ,data_source_id INTEGER NOT NULL  ENCODE lzo
    ,name VARCHAR(4000) NOT NULL  ENCODE lzo
    ,created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT getdate() ENCODE lzo
    ,PRIMARY KEY (id)
    ,UNIQUE (source_id, data_source_id)
)
DISTSTYLE EVEN
;

After some experimentation I found that the '1,1' in the identity line refers to the start value (seed) and step value respectfully. I think 347358 is the table's oid - but not sure about ,0,. I do not see a reference to a next value for a sequence.

Hope this helps and thank you again Dimitri.

dimitri commented 6 years ago

Now that I have a table using an identity default value from Redshift, I could fix the casting process to switch to using serial/bigserial instead. Note that I didn't implement parsing the seed and step values at this point. It's all considered the same as a MySQL _autoincrement “extra” hint. Hope that works for you, otherwise, as usual, please comment here.

tendres commented 6 years ago

This. Is. Amazing. Thank you Dimitri!

vmatt commented 3 years ago

I'm creating a copy script that copies tables between clusters via s3. However, I have a fear, that the table_id would cause issues, so I decided to created a regex to match it. Matching group 4 contains the starting value, and the incerement, you can pytgon re.groups(4) to get the values.

In my usecase, if match found it will remove/replace it with this, else it will leave the original value _id BIGINT NOT NULL DEFAULT defaultidentity(1004356, 0, '100,1'::text) ENCODE az64 -> id BIGINT NOT NULL identity(100, 1)

here's my regex: (default)?(\"|\_)identity[\"]?\([0-9]{3,}\,\s[0-9]+\,\s(\()?(\'[0-9]+\,[0-9]+\')?\:\:.+\)