Open whitleykeith opened 6 months ago
I believe this is an issue with tables that have columns named path
as it seems this error consistently happens on them and I can see DV look to have path
columns defined: https://github.com/delta-io/delta/blob/7f199febb84d2c62218fdffbc3a7fe1e48086638/spark/src/main/scala/org/apache/spark/sql/delta/commands/DMLWithDeletionVectorsHelper.scala#L422-L424
Bug
Which Delta project/connector is this regarding?
Describe the problem
TL;DR: Certain MERGE operations with deletion vectors enabled can consistently fail, though more investigation is needed on why these specific MERGEs fail
For context, we have a system to incremental take snapshots from upstream JDBC sources and write them into Delta. This system ultimately creates a DF that looks something like this like this:
The
_is_delete
column is a temporary column in this DF to determine if a row is being deleted or not in the Delta table. This DF is then MERGED into our existing snapshot table (we would have taken a normal snapshot if the table didn't exist yet), updating/deleting necessary rows. We do this in one MERGE so we can have single transaction for a given snapshot, and this works pretty well for our tables across the board.We recently enabled Deletion Vectors for performance benefits, etc. and have noticed a sparse-yet-unavoidable ERRORs since enabling it. The core error is
Reference `filePath` is ambiguous, could be: [`filePath`, `filePath`, `filePath`]
, and the stacktrace (pasted below) indicates this is happening when building the DV.We've noticed the following:
filePath
, but does have one calledpath
.Steps to reproduce
This is the following merge command we use:
Observed results
Full Failure of MERGE operation
Expected results
Successful MERGE operation
Further details
Stacktrace:
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?