airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.4k stars 3.97k forks source link

Destination S3: CSV Configuration #10145

Open danjsiegel opened 2 years ago

danjsiegel commented 2 years ago

Configuration of the CSV for target S3 data.

https://airbytehq-team.slack.com/archives/C019WJFR1JT/p1643920535301539

harshithmullapudi commented 2 years ago

Hey @danjsiegel it would be great if you can share the usecase you have for this requirement

danjsiegel1 commented 2 years ago

Use case would be I have to deliver data and there can be multiple formatting issues with how the data can be ingested. The people downstream are particularly picky about the format, but the ability to add a header and escape special characters should be pretty self evident.

ameyabapat-bsft commented 2 years ago

Any update on this task? In our usecase snowflake -> AWS S3 (CSV-root normalization), due to absence of string quoting and escape characters, nested json values in snowflake columns appear on separate rows instead of single row in output CSV file. It disturbs our csv parser's working. Can we add such configurations in target CSV S3 data?

ameyabapat-bsft commented 2 years ago

Snowflake Table:

create table demonstration1 (
    id integer,
    array1 array,
    variant1 variant,
    object1 object
    );

insert into demonstration1 (id, array1, variant1, object1) 
  select 
    1, 
    array_construct(1, 2, 3), 
    parse_json(' { "key1": "value1", "key2": "value2" } '),
    parse_json(' { "outer_key1": { "inner_key1A": "1a", "inner_key1B": "1b" }, '
              ||
               '   "outer_key2": { "inner_key2": 2 } } ')
    ;

Content of CSV in S3 after the airbyte sync is following. It has single row in snowflake but multiple rows in CSV file in S3.

"_airbyte_ab_id","_airbyte_emitted_at","ARRAY1","ID","OBJECT1","VARIANT1"
"102cf1b2-d470-4aef-a7bc-61d8aeeaddc6","1645773035141","[
  1,
  2,
  3
]","1","{
  ""outer_key1"": {
    ""inner_key1A"": ""1a"",
    ""inner_key1B"": ""1b""
  },
  ""outer_key2"": {
    ""inner_key2"": 2
  }
}","{
  ""key1"": ""value1"",
  ""key2"": ""value2""
}"

Can we solve it with string quoting or escape characters?

ameyabapat-bsft commented 2 years ago

Any updates on this ticket?