Closed veryveryfat closed 4 years ago
I have same situation, but want parquet. My glue job does not recognize partitioned table. It shows columns except partitions. Do you find any solutions?
Would you try to modify 2 part in default ETL script? 1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column 2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}
The first one should make the Dynamic frame to mapping the partition. Second one is newly add method in aws document https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
For a more detailed example on how to work with Hive-style s3 partitions with AWS Glue, please refer:
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
@wilywily I am also having the same problem , have tried the mentioned steps in part 2 of the ETL , added the partition column name in the mapping and added them in partitionKeys but when i run the job it gets stuck in running for a long time , when i run the job without partition it gets completed in 2-3 min but when i try with the partition its running for around 30 min and not giving any error or log
Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though
I tried above link but its not generating partition correctly. What I got
year=HIVE_DEFAULT_PARTITION/month=HIVE_DEFAULT_PARTITION/day=HIVE_DEFAULT_PARTITION/part-00069-5f49a46d-cc10-424c-a95a-a4a40ad3252a.c000.snappy.orc
Before pushing the data to sink I used the printSchema and show methods just to make sure data is coming properly from data store.
I tried two approaches to write data into partitions.
and second converting the DynamicFrame into DataFrame.
val df = applymapping1.toDF()
df.write.mode(SaveMode.Append).format("orc").partitionBy("year", "month", "day").save("
I have this problem as well. My source bucket has path style partitions, and I want to continue to use those properties as partitions after an ApplyMapping, but I end up with the HIVE_DEFAULT_PARTITION as the value in the S3 path.
Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though
I got the partitions to be created successfully using:
glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://path/", "partitionKeys": ["year","month","day"]}, format = "parquet"}
I had same issue and fixed by changing the type of partition field to string.
As it is already explained, write_dynamic_frame.from_options
has partitionKeys
option to keep partition structure at the output location.
See details in this blog post. https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Resolving.
Would you try to modify 2 part in default ETL script? 1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column 2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}
The first one should make the Dynamic frame to mapping the partition. Second one is newly add method in aws document https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
Your post save my day. Thanks
ApplyMapping_node2 = ApplyMapping.apply(
frame=DataCatalogtable_node,
mappings=[
("partition_0", "string", "partition_0", "string"),
("partition_1", "string", "partition_1", "string"),
....
I have original files in S3 with folder structure like: /data/year=2017/month=1/day=1/origin_files /data/year=2017/month=1/day=2/origin_files
I use glue crawler create a table data(partitioned) as source of glue jobs. Currently after I use glue job converting files to ORC, I get: /data_orc/transformed_data_files.orc
Is that possible to keep same partition structure after transforming jobs? like: /data_orc/year=2017/month=1/day=1/transformed_data_files.orc /data_orc/year=2017/month=1/day=2/transformed_data_files.orc
It doesn't have to be file to file matching, but I hope the data partition can keep same folder structure.