Naissant / dendri

Common Healthcare feature engineering algorithms implemented in PySpark.
MIT License
2 stars 1 forks source link

readParquetTable raises ParseException when DataFrame is saved without bucketing/sorting #4

Closed rileyschack closed 3 years ago

rileyschack commented 3 years ago

If you save a DataFrame with saveParquetTable and do not specify any bucketing and/or sort by columns, then readParquetTable is raising a ParseException.

df = spark.createDataFrame(
    data=[(1, 2, 3), (1, 2, 4), (2, 3, 1), (3, 3, 1)],
    schema=["col1", "col2", "col3"],
)

df.saveParquetTable(
    table_name="tmp",
    file_path="tmp.parquet",
    partition_cols="col1",
)

spark.readParquetTable(
    table_name="tmp", file_path="tmp.parquet"
)
ParseException: 
mismatched input 'None' expecting INTEGER_VALUE(line 1, pos 149)

== SQL ==
CREATE TABLE IF NOT EXISTS tmp (col1 bigint, col2 bigint, col3 bigint) USING PARQUET PARTITIONED BY (col1) CLUSTERED BY (None) SORTED BY (None) INTO None BUCKETS LOCATION 'tmp.parquet'

Looks like readParquetTable isn't converting the None to "None", and _sql_builder is expecting a str value for the partitions, buckets, and sort by columns.