delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

Getting error when converting a partitioned parquet table to delta table #2626

Open AngeloFrigeri opened 2 days ago

AngeloFrigeri commented 2 days ago

Environment

Delta-rs version: 0.18.1

Binding: python 3.8.19

Environment:


Bug

What happened: When converting a partitioned parquet table to delta table, I got the following error:

File [~/Envs/general-p38/lib/python3.8/site-packages/deltalake/writer.py:610](Envs/general-p38/lib/python3.8/site-packages/deltalake/writer.py#line=609), in convert_to_deltalake(uri, mode, partition_by, partition_strategy, name, description, configuration, storage_options, custom_metadata)
    607 if mode == "ignore" and try_get_deltatable(uri, storage_options) is not None:
    608     return
--> 610 _convert_to_deltalake(
    611     str(uri),
    612     partition_by,
    613     partition_strategy,
    614     name,
    615     description,
    616     configuration,
    617     storage_options,
    618     custom_metadata,
    619 )
    620 return

DeltaError: Generic error: The schema of partition columns must be provided to convert a Parquet table to a Delta table

What you expected to happen: To have a delta log folder create on our S3 path

How to reproduce it:

from deltalake import convert_to_deltalake

s3_storage_options = {
    AWS_ACCESS_KEY_ID="AWS_ACCESS_KEY_ID",
    AWS_SECRET_ACCESS_KEY="AWS_SECRET_ACCESS_KEY",
    AWS_S3_ALLOW_UNSAFE_RENAME="true",
}
convert_to_deltalake(
    uri=f"s3://{BUCKET}/{PREFIX}_0.18.1/",
    storage_options=s3_storage_options,
    partition_by=pyarrow.schema(
        [
            pyarrow.field("product", pyarrow.string()),
        ]
    ),
    partition_strategy="hive",
)

More details:

sherlockbeard commented 2 days ago

Tried with (below ) not able to reproduce in local . Maybe specific to s3. @AngeloFrigeri can you check partition column are correct in your code and not an issue there.

from deltalake import convert_to_deltalake

import pyarrow as pa

import pandas as pd

df = pd.DataFrame(data={'blaaPara': ['a', 'a', 'b'],
                        'year':  [2020, 2020, 2021],
                        'month': [1,12,2], 
                        'day':   [1,31,28], 
                        'value': [1000,2000,3000]})

df.to_parquet('./mydf', partition_cols=['blaaPara'])

convert_to_deltalake(
    './mydf',
    partition_by=pa.schema(
        [
            pa.field("blaaPara", pa.string()),
        ]
    ),
    partition_strategy="hive"
)