wr.s3.to_parquet creates table with all null values if partition_cols param is specified

aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

https://aws-sdk-pandas.readthedocs.io

Apache License 2.0

3.92k stars 699 forks source link

wr.s3.to_parquet creates table with all null values if partition_cols param is specified #397

Closed Xiangyu-C closed 4 years ago

Xiangyu-C commented 4 years ago

Describe the bug When I call wr.s3.to_parquet() with dataset=True, and if partition_cols are specified (two columns in the dataframe are used as partitions), the table gets created in glue and parquet file is saved. However, all the values in the table are null except for the two partition columns. Once I disable the partition_cols param, the data look good in glue (no null values).

To Reproduce Just use the same dataframe and try with or without partition_cols params in wr.s3.to_parquet call and check the table in glue to see if values are good or not. awswrangler version = 1.9.3 and through pip install

igorborgest commented 4 years ago

Hi @Xiangyu-C, thanks for reaching out.

I've tried to replicate this issue here, but everything seems to be fine

import pandas as pd
import awswrangler as wr
df = pd.DataFrame({"c0": [1, 2], "c1": [1, 2], "c2": [1, 2]})
wr.s3.to_parquet(
    df=df,
    path="s3://BUCKET/PREFIX/",
    dataset=True,
    database="default",
    table="test",
    partition_cols=["c1", "c2"]
)

Could you try the snippet above in your environment? Also, could you send me a replicable snippet that exposes your issue?

Xiangyu-C commented 4 years ago

@igorborgest Thank you for the response. I'd like to correct my statement. When I use partition_cols param, the table gets created. In Athena query, the table looks normal. All the values are correct. But in HUE, all the values appear null except for the partition columns. Hope this clears up any confusion.

igorborgest commented 4 years ago

@Xiangyu-C What are you using to query on HUE? Hive? Presto?

Xiangyu-C commented 4 years ago

using Hive

igorborgest commented 4 years ago

Do you have more details about your setup? Is it on EMR? Which EMR release version? Is the Glue Catalog integration enabled on this cluster?

Xiangyu-C commented 4 years ago

@igorborgest Sorry here are the details. We are using AWS EMR with HUE enabled. Here are the detailed versions of all the applications on the EMR cluster (r5.4xLarge 1 master and 2 workers). Glue catalog is enabled on this cluster. Table is normal in HUE if I don't use partition_cols.

Release label:emr-5.24.1 Hadoop distribution:Amazon 2.8.5 Applications:Hue 4.4.0, Tez 0.9.1, Spark 2.4.2, Hive 2.3.4, Presto 0.219, Ganglia 3.7.2, Sqoop 1.4.7, Oozie 5.1.0, Livy 0.6.0, JupyterHub 0.9.6

igorborgest commented 4 years ago

Thanks a lot @Xiangyu-C, we will troubleshoot it.

igorborgest commented 4 years ago

Fix released on version 1.9.4