duckdb / duckdb_aws

MIT License
34 stars 12 forks source link

Unnest DynamoDB JSON format #19

Open borgoat opened 7 months ago

borgoat commented 7 months ago

Hello DuckDB team!

I'm not sure if this repo is the best place to discuss this, since it's not something that strictly depends on the AWS SDK. AFAIK Dynamo unmarshaling is not in the C++ SDK at least^1. However, it is quite related to AWS usage in general...

I'm trying to run some ETL job on a DynamoDB table previously exported to S3^2 as NDJSON files. DuckDB will happily read it, but then I'm stuck with this fairly impractical schema generated by DynamoDB.

D select PK, SK, __typename, isMarketOpen from main limit 5;
┌────────────────────────────────────────────────────┬──────────────────────────────────────────────┬──────────────────────────────┬──────────────────────┐
│                         PK                         │                      SK                      │          __typename          │     isMarketOpen     │
│                 struct(s varchar)                  │              struct(s varchar)               │      struct(s varchar)       │ struct(bool boolean) │
├────────────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────┼──────────────────────┤
│ {'S': OBJECT#e036c748-61bb-4431-96da-eeb340c96ec7} │ {'S': QUOTE#NA9#XETR}                        │ {'S': Quote}                 │ {'BOOL': true}       │
│ {'S': USER#Google_101534313370849411321}           │ {'S': #PROFILE#Google_101534313370849411321} │ {'S': User}                  │                      │
│ {'S': OBJECT#ao_01H9QN66R50VDVAMZ79CXRHNE9}        │ {'S': OBJECT}                                │ {'S': ActivityStreamsObject} │                      │
│ {'S': OBJECT#457acd68-8900-4083-92c7-e3891c39648d} │ {'S': OBJECT}                                │ {'S': Stock}                 │                      │
│ {'S': OBJECT#457acd68-8900-4083-92c7-e3891c39648d} │ {'S': QUOTE#DEFAULT}                         │ {'S': Quote}                 │ {'BOOL': true}       │
└────────────────────────────────────────────────────┴──────────────────────────────────────────────┴──────────────────────────────┴──────────────────────┘

This will include some maps, lists, and sets that should be recursively unnested into DuckDB structs and lists to process them effectively.

AWS Glue supposedly offers a DynamicFrame transformation to do the same^3 (it doesn't actually work when it comes to nested objects, which is where it's needed the most... but that's a different story)

I think this could be an extension function in DuckDB. What do you think? I'd be happy to work on it if you agree on where it should be implemented.

Thanks! Giorgio

samansmink commented 7 months ago

Hey @borgoat!

This seems like a useful utility function to me, that should live in the AWS extension! If you would like to add it, i can definitely review and merge.