Closed geekodour closed 1 month ago
For anyone who finds themselves here:
Here's what worked, (my issue was that i was stuck on a typo for hours)
# In our usecase, these if these 3 together are unique, then we have a match for a merge
identity_columns = ["a", "b", "c"]
# This can be computed from elsewhere based on your usecase
# the value of x and y are both string
lookup_partitions = ["x", "y"]
# These follow datafusion sql syntax
merge_predicate = " AND ".join([f"s.{i} = t.{i}" for i in identity_columns])
lookup_predicate = " OR ".join([f"t.block_range='{v}'" for v in lookup_partitions])
predicate = f"({lookup_predicate}) AND ({merge_predicate})"
df.write_delta(
table_path,
mode="merge",
storage_options=storage_options,
delta_merge_options={
"predicate": predicate,
"source_alias": "s",
"target_alias": "t",
},
).when_matched_update_all().when_not_matched_insert_all().execute()
I was trying to merge to a large table, everytime I am trying to merge, It's loading the entire table in memory and based on how python+polars work and how delta table merge works we're already taking up some memory.
I think the caching logic mentioned here is still not applied in delta-rs.
Apart from these optimizations which are about memory release and allocation, I am facing an issue related to data loading into polars from delta lake created using delta-rs. Following is the issue description:
I want to be merging to a delta table, I can't seem to find a way to specify the partition keys it should use to lookup the tables. Should it be part of the
predicate
(using datafusion syntax) indelta_merge_options
when usingwrite_delta
? When using polars we can passpyarrow_options
for bothread_delta
andscan_delta
.related docs:
related issues:
I am posting more of a polars related question here, but this directly translates to the delta-rs python api aswell so I think this question still relevant because I can't seem to find a way to specify partition in the table merger docs aswell: https://delta-io.github.io/delta-rs/api/delta_table/delta_table_merger/
From this discussion, @ion-elgreco mentions that merge is done in rust so we can't pass it to pyarrow.
I'll update this issue with more info related to this.
Please let me know if any more info here.