dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
305 stars 122 forks source link

CSV file with "|" delimiter gets data offset to wrong column. #77

Closed LagerVegard closed 3 years ago

LagerVegard commented 3 years ago

Describe the bug

Azure blob storage & Snowflake

I'm able to import the data using standard formatting in snowflake: COPY INTO CASHIER FROM @my_azure_stage Trying to replace this I get an error message: "String '2015-07-29 15:08:23.0000000' is too long and would be truncated" when altering the columns i can see that the string has jumped to the next column and the last column is missing.

Steps to reproduce

Failing row of data: 1|2015-07-29 15:08:17.0000000|99|99|1|Extenda Support|0505|0.0000|0.0000|0|5|8388607||0|0x00000000D5EE2845|0.0000|0|0|1||2015-07-29 15:08:23.0000000|False CSV Format: create or replace file format lcsv TYPE = CSV FIELD_DELIMITER = '|' SKIP_HEADER = 0 schema.yml version: 2

sources:

Expected results

PASSWORDCHANGEDDATE = 2015-07-29 15:08:23.0000000 PASSWORDRESETFLAG = False

Actual results

PASSWORDCHANGEDDATE = PASSWORDRESETFLAG = 2015-07-29 15:08:23.0000000

System information

packages:

Which database are you using dbt with?

The output of dbt --version: installed version: 0.19.0 latest version: 0.19.0

Up to date!

Plugins:

The operating system you're using: Ubuntu sub system, ubuntu:latest docker The output of python --version: Python 3.8.5

jtcohen6 commented 3 years ago

Hey @LagerVegard, I'm not sure exactly what's gone wrong here! I copy-pasted your line of errant data, split it on the | delimiter, and got back the right number of columns (22).

Have you tried copying the create external table DDL from logs/dbt.log and running it yourself in the Snowflake console, to see if the same thing goes wrong?

The dbt-external-tables package exists to compile and execute boilerplate DDL/DML to create and update external tables, based on the table and column information included in your .yml files. It cannot actually guarantee that those DDL/DML statements will properly reflect the data in external storage; that's between you, Azure blob, and Snowflake to figure out. If you find that the package should be templating different DDL/DML, that's something we can make happen with a code change in this repo.

LagerVegard commented 3 years ago

Looking at logs/dbt.log i see that i'm loosing a column in the definition.

create or replace external table ADF.dbt_user.table( HQID NUMBER(38) as (nullif(value:c1,'')::NUMBER(38)), LASTUPDATED TIMESTAMP as (nullif(value:c2,'')::TIMESTAMP), NUMBER VARCHAR(9) as (nullif(value:c3,'')::VARCHAR(9)), STOREID NUMBER(38) as (nullif(value:c4,'')::NUMBER(38)), ID NUMBER(38) as (nullif(value:c5,'')::NUMBER(38)), NAME VARCHAR(50) as (nullif(value:c6,'')::VARCHAR(50)), PASSWORD VARCHAR(512) as (nullif(value:c7,'')::VARCHAR(512)), FLOORLIMIT NUMBER(38) as (nullif(value:c8,'')::NUMBER(38)), RETURNLIMIT NUMBER(38) as (nullif(value:c9,'')::NUMBER(38)), CASHDRAWERNUMBER NUMBER(38) as (nullif(value:c10,'')::NUMBER(38)), SECURITYLEVEL NUMBER(38) as (nullif(value:c11,'')::NUMBER(38)), PRIVILIGES NUMBER(38) as (nullif(value:c12,'')::NUMBER(38)), EMAILADDRESS VARCHAR(255) as (nullif(value:c13,'')::VARCHAR(255)), FAILEDLOGINATTEMPTS NUMBER(38) as (nullif(value:c14,'')::NUMBER(38)), DBTIMESTAMP VARCHAR(50) as (nullif(value:c15,'')::VARCHAR(50)), MAXOVERSHORTAMOUNT NUMBER(38) as (nullif(value:c16,'')::NUMBER(38)), --missing OVERSHORTLIMITTYPE NUMBER(38) as (nullif(value:c17,'')::NUMBER(38)), AUTOID NUMBER(38) as (nullif(value:c18,'')::NUMBER(38)), TELEPHONE VARCHAR(30) as (nullif(value:c19,'')::VARCHAR(30)), PASSWORDCHANGEDDATE VARCHAR(50) as (nullif(value:c20,'')::VARCHAR(50)), PASSWORDRESETFLAG VARCHAR(10) as (nullif(value:c21,'')::VARCHAR(10)) ) location = @my_azure_stage

LagerVegard commented 3 years ago

Found the issue.

name: OVERSHORTLIMITTYPE data_type: NUMBER(38) description: 'Over Short Limit Type'

was missing