How to keep the partition structure of the folder after ETL?

veryveryfat commented 6 years ago

I have original files in S3 with folder structure like: /data/year=2017/month=1/day=1/origin_files /data/year=2017/month=1/day=2/origin_files

I use glue crawler create a table data(partitioned) as source of glue jobs. Currently after I use glue job converting files to ORC, I get: /data_orc/transformed_data_files.orc

Is that possible to keep same partition structure after transforming jobs? like: /data_orc/year=2017/month=1/day=1/transformed_data_files.orc /data_orc/year=2017/month=1/day=2/transformed_data_files.orc

It doesn't have to be file to file matching, but I hope the data partition can keep same folder structure.

dogenius01 commented 6 years ago

I have same situation, but want parquet. My glue job does not recognize partitioned table. It shows columns except partitions. Do you find any solutions?

wilywily commented 6 years ago

Would you try to modify 2 part in default ETL script? 1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column 2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}

The first one should make the Dynamic frame to mapping the partition. Second one is newly add method in aws document https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

mohitsax commented 6 years ago

For a more detailed example on how to work with Hive-style s3 partitions with AWS Glue, please refer:

https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

ujjwalit commented 6 years ago

@wilywily I am also having the same problem , have tried the mentioned steps in part 2 of the ETL , added the partition column name in the mapping and added them in partitionKeys but when i run the job it gets stuck in running for a long time , when i run the job without partition it gets completed in 2-3 min but when i try with the partition its running for around 30 min and not giving any error or log

nicolasdij commented 6 years ago

Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though

aahmed-vzw commented 6 years ago

I tried above link but its not generating partition correctly. What I got

year=HIVE_DEFAULT_PARTITION/month=HIVE_DEFAULT_PARTITION/day=HIVE_DEFAULT_PARTITION/part-00069-5f49a46d-cc10-424c-a95a-a4a40ad3252a.c000.snappy.orc

Before pushing the data to sink I used the printSchema and show methods just to make sure data is coming properly from data store.

I tried two approaches to write data into partitions.

glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "", "partitionKeys": ["year","month","day"]}"""), transformationContext = "datasink4", format = "orc").writeDynamicFrame(dropnullfields3)

and second converting the DynamicFrame into DataFrame.

val df = applymapping1.toDF() df.write.mode(SaveMode.Append).format("orc").partitionBy("year", "month", "day").save("")

hakenmt commented 6 years ago

I have this problem as well. My source bucket has path style partitions, and I want to continue to use those properties as partitions after an ApplyMapping, but I end up with the HIVE_DEFAULT_PARTITION as the value in the S3 path.

rkarato commented 5 years ago

Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though

I got the partitions to be created successfully using:

glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://path/", "partitionKeys": ["year","month","day"]}, format = "parquet"}

adarmanto commented 5 years ago

I had same issue and fixed by changing the type of partition field to string.

moomindani commented 4 years ago

As it is already explained, write_dynamic_frame.from_options has partitionKeys option to keep partition structure at the output location. See details in this blog post. https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

Resolving.

ffgarciam commented 1 year ago

Would you try to modify 2 part in default ETL script? 1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column 2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}

The first one should make the Dynamic frame to mapping the partition. Second one is newly add method in aws document https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

Your post save my day. Thanks

ApplyMapping_node2 = ApplyMapping.apply(
    frame=DataCatalogtable_node,
    mappings=[
        ("partition_0", "string", "partition_0", "string"),
        ("partition_1", "string", "partition_1", "string"),
....

aws-samples / aws-glue-samples

How to keep the partition structure of the folder after ETL? #11