Azure / Azure-DataFactory

Other
483 stars 589 forks source link

Improper Deletes in Delta Format #560

Open wsugarman opened 1 year ago

wsugarman commented 1 year ago

I have a simple Data Flow that is writing to an inline Delta sink that writes to an ADLS Gen 2 storage account. Before I output to the sink, I use an Alter Row modifier to ensure each row is properly inserted into, updated in, or removed from the table. Unfortunately, when I ran the Data Flow as the sole Pipeline Activity, it writes one parquet file with all of the rows and records it as an add action in the transaction log. I would have expected instead to see the rows properly dispersed into multiple parquet files with the appropriate add and remove action.

In my small example, I was writing into an empty directory and the rows consist of the same key being added and removed multiple times. Am I missing something in my pipeline?

image
clintgrove commented 1 year ago

I think that a Table Action of "Overwrite" will only write "remove" in the delta logs. You say you are getting "add" actions, which is not what I would expect. Try set the table action to none and see what happens perhaps? Overwrite in my experience is not the right action if you want to keep a history of previous loads and have the ability to time travel in your delta table