Closed Xiangyu-C closed 4 years ago
Hi @Xiangyu-C, thanks for reaching out.
I've tried to replicate this issue here, but everything seems to be fine
import pandas as pd
import awswrangler as wr
df = pd.DataFrame({"c0": [1, 2], "c1": [1, 2], "c2": [1, 2]})
wr.s3.to_parquet(
df=df,
path="s3://BUCKET/PREFIX/",
dataset=True,
database="default",
table="test",
partition_cols=["c1", "c2"]
)
Could you try the snippet above in your environment? Also, could you send me a replicable snippet that exposes your issue?
@igorborgest Thank you for the response. I'd like to correct my statement. When I use partition_cols param, the table gets created. In Athena query, the table looks normal. All the values are correct. But in HUE, all the values appear null except for the partition columns. Hope this clears up any confusion.
@Xiangyu-C What are you using to query on HUE? Hive? Presto?
using Hive
Do you have more details about your setup? Is it on EMR? Which EMR release version? Is the Glue Catalog integration enabled on this cluster?
@igorborgest Sorry here are the details. We are using AWS EMR with HUE enabled. Here are the detailed versions of all the applications on the EMR cluster (r5.4xLarge 1 master and 2 workers). Glue catalog is enabled on this cluster. Table is normal in HUE if I don't use partition_cols.
Release label:emr-5.24.1 Hadoop distribution:Amazon 2.8.5 Applications:Hue 4.4.0, Tez 0.9.1, Spark 2.4.2, Hive 2.3.4, Presto 0.219, Ganglia 3.7.2, Sqoop 1.4.7, Oozie 5.1.0, Livy 0.6.0, JupyterHub 0.9.6
Thanks a lot @Xiangyu-C, we will troubleshoot it.
Fix released on version 1.9.4
Describe the bug When I call wr.s3.to_parquet() with dataset=True, and if partition_cols are specified (two columns in the dataframe are used as partitions), the table gets created in glue and parquet file is saved. However, all the values in the table are null except for the two partition columns. Once I disable the partition_cols param, the data look good in glue (no null values).
To Reproduce Just use the same dataframe and try with or without partition_cols params in wr.s3.to_parquet call and check the table in glue to see if values are good or not. awswrangler version = 1.9.3 and through pip install