aws / aws-sdk

Landing page for the AWS SDKs on GitHub
https://aws.amazon.com/tools/
Other
68 stars 12 forks source link

AWS Glue - Null values in partition column #670

Open jmklix opened 6 months ago

jmklix commented 6 months ago

Original discussion: https://github.com/aws/aws-sdk-cpp/discussions/2803

It seems like glue isn't handling the getpartitions API correctly where a partition column has null value. Example below, I am using the aws cli for simiplicity , which gives the same output as the SDK

My table data is structured as below in S3

s3://example-bucket/example_table/
├── int_partition_col=null/
│   ├── string_partition_col=null/
│   │   └── data-part-00001.csv
├── int_partition_col=1/
│   ├── string_partition_col=A/
│   │   └── data-part-00002.csv
└── int_partition_col=2/
    ├── string_partition_col=B/
    │   └── data-part-00003.csv
> aws glue get-partitions --database-name example_db --table-name example_table --expression "(int_partition_col >= 0)" ->
An error occurred (InvalidStateException) when calling the GetPartitions operation: For input string: "null" is not an integer.

> aws glue get-partitions --database-name example_db --table-name example_table --expression "(string_partition_col is null)" -> Returns empty

> aws glue get-partitions --database-name example_db --table-name example_table --expression "(string_partition_col = 'null')"-> works correctly

So it seems like the null value is being considered as a string literal? But from the documentation here, it seems IS NULL etc are supported?

jmklix commented 6 months ago

P111656246

kambhamvivekshankar commented 3 months ago

I too am facing same issue. Can this ticket be prioritized. Also delta seems to be writing null as __HIVE_DEFAULT_PARTITION__. Can we include native support for this as well.

kambhamvivekshankar commented 3 months ago

__HIVE_DEFAULT_PARTITION__ is a common use case many query engines support. https://cwiki.apache.org/confluence/display/hive/configuration+properties#ConfigurationProperties-hive.exec.default.partition.name