coursera / dataduct

DataPipeline for humans.
Other
252 stars 83 forks source link

UPSERT does not respect CHAR size for temporary tables #257

Open florian-niefind opened 7 years ago

florian-niefind commented 7 years ago

Used: dataduct 0.4.0

The auto-generated sql-query for the upsert step does not carry over the size of a CHAR column. It seems to only work with VARCHAR columns. Example: My Table definitions for the upsert step looks like:

CREATE TABLE IF NOT EXISTS staging.example (col1 VARCHAR(20), col2_buggy CHAR(2));

CREATE TABLE IF NOT EXISTS public.example (col1 VARCHAR(20), col2_buggy CHAR(2));

And the auto-generated sql-query (which is fortunately printed to stdout) for the upsert will look like :

CREATE TEMPORARY TABLE example_temp (col1 VARCHAR(20), col2_buggy CHAR);
INSERT INTO user_activity_temp (SELECT * FROM staging.user_activity LIMIT 10);
and so on...

col2 in the temporary table is thus generated with the default size (1) and carryover fails. Pipeline will fail giving me:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/dataduct/steps/executors/runner.py", line 100, in sql_runner
    cursor.execute(sql_query)
psycopg2.InternalError: Value too long for character type
DETAIL:  
  -----------------------------------------------
  error:  Value too long for character type
  code:      8001
  context:   Value too long for type character(1)
  query:     8143
  location:  funcs_string.hpp:392
  process:   query0_25 [pid=31361]

Everything works fine if i change col2 ro VARCHAR as well.

EDIT: traced the problem here https://github.com/coursera/dataduct/blob/develop/dataduct/database/parsers/utils.py#L26-L28

CHAR pattern definition does not include Word(alphanums)

seguschin commented 7 years ago

You not need to use upsert step as RedshiftCopyActivity supports upsert internally, use option insert_mode: OVERWRITE_EXISTING

sergei