Open gjgarryuan opened 1 year ago
In addition, another finding is that ddb-json numeric string got converted into string directly.
Before:
root
|-- Item: struct
| |-- version: struct
| | |-- N: string
After:
root
|-- version: string
Similar issue for me, for instance, original structure:
root
|-- Item: struct
| |-- Id: struct
| | |-- S: string
| |-- EntityName: struct
| | |-- S: string
| |-- Message: struct
| | |-- M: struct
| | | |-- MessageType: struct
| | | | |-- S: string
| | | |-- MessageData: struct
| | | | |-- M: struct
| | | | | |-- UserName: struct
| | | | | | |-- S: string
| | | | | |-- At: struct
| | | | | | |-- S: string
gets converted into
root
|-- Id: string
|-- EntityName: string
|-- Message: struct
| |-- MessageType: struct
| | |-- S: string
| |-- MessageData: struct
| | |-- M: struct
| | | |-- UserName: struct
| | | | |-- S: string
| | | |-- At: struct
| | | | |-- S: string`
It is dynamodb dedicated, but it doesn't remove dynamodb types indicators (e.g. "S":, "M") in whole nested data but only in first level. It would be great if it removed types indicator and keep the structure (or conditionally flat data completly)
I'm migrating data from one dynamodb table to another, and I need to add some additional data during migration. It seems like I can't use unnest_ddb_json
to prepare data, and I have to write my own function.
I'm facing the same problem. Did anyone find a workaround for this issue?
best way around this is the following sample code:
from boto3.dynamodb.types import TypeDeserializer
deserializer = TypeDeserializer()
def mapping_function(record):
record = {k: deserializer.deserialize(value=v) for k,v in record['Item'].items()}
dyf_ddb_unnested = dyf_ddb.map(mapping_function, transformation_ctx="unnest_ddb_frame")
Note that you may have to explicitly convert number
type columns to int
or float
. The boto3 TypeDeserializer
will turn numbers into Decimal
objects to preserve the accuracy of the number
type in dynamodb
The solution of @Kasra-G was a great hint but didn't work out of the box for me.
I needed to call .items()
on record['Item']
first before iterating and also just returned the new record:
- record = {k: deserializer.deserialize(value=v) for k,v in record['Item']}
+ return {k: deserializer.deserialize(value=v) for (k,v) in record['Item'].items()}
After lot of googling for some help and going through lot of pain of trying to convert dynamo json myself (for some reason, the solution above was giving all NULLs in my case), I found this method - simplify_ddb_json(): DynamicFrame
, thought might help others with conversion.
here is the link: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-simplify
After lot of googling for some help and going through lot of pain of trying to convert dynamo json myself (for some reason, the solution above was giving all NULLs in my case), I found this method -
simplify_ddb_json(): DynamicFrame
, thought might help others with conversion. here is the link: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-simplify
This can also help @pwrstar, but just pay attention to this point: "If there are several types or types of Map in a type of List, the List elements will not be simplified".
Hey I am working on a glue job to export ddb data into s3 using the new ddb export connector.
Glue version: 4.0 Language: Python 3
Script:
The schema before
unnest_ddb_json
is:and after the unnest:
As you can see above,
data
andaccount
) are "partially" ddb-unnested because their items are still in the ddb-json formataccount.activated
does not get hoistedIs this behaviour expected?