delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.32k stars 407 forks source link

add documentation how to configure delta.logRetentionDuration #2072

Closed djouallah closed 8 months ago

djouallah commented 10 months ago

Description

I am trying to create a delta table like this with a log limited to 1 day

from deltalake import DeltaTable
import pyarrow as pa
dt = DeltaTable.create(
    table_uri='s3://aemo/scada2',
     schema = pa.schema([
                      pa.field('SETTLEMENTDATE', pa.timestamp('us')),
                      pa.field('DUID', pa.string()),
                      pa.field('SCADAVALUE', pa.float64()),
                      pa.field('Date', pa.date32()),
                      pa.field('week', pa.string()),
                      pa.field('file', pa.string())
                      ]) ,
mode ='error',
partition_by="week",
configuration = {"delta.logRetentionDuration": "1 days"} ,
storage_options=storage_options
)

when i run dt.cleanup_metadata() it seems it still using 30 days ?

ion-elgreco commented 10 months ago

@djouallah does the table contain checkpoints? Otherwise it doesn't remove any logs since that could corrupt the table

djouallah commented 10 months ago

@ion-elgreco it does indeed is this Delta Rust specific ?

ion-elgreco commented 10 months ago

@djouallah if you're using 0.15.1 it does the correct behavior of only removing up to a checkpoint based on the logRetetentionDuration. Before 0.15.1 it would actually remove based on the logRetentionDuration only which could invalidate a table state.

djouallah commented 10 months ago

@ion-elgreco I am using 0.15.1 and it is not removing anything, is the format I used correct ?

ion-elgreco commented 10 months ago

Ah format should be interval <amount> <unit>, so try interval 1 day.

At the same time can you try interval 1 days, I think at the moment we don't parse the plural version so this might not work..

djouallah commented 10 months ago

sorry for being pedantic, but "1 days" is what spark uses, for compatibility reasons, isn't delta rust follow the same approach, is the format in delta protocol ?

ion-elgreco commented 10 months ago

@djouallah these things are not part of the protocol. I am aware that the plural version is what spark only supports, we can add that in soon, it's trivial to add

djouallah commented 10 months ago

@ion-elgreco i appreciate you are doing free work, I am happy with whatever you pick, I was just curious :)