MrPowers / mack

Delta Lake helper methods in PySpark
https://mrpowers.github.io/mack/
MIT License
286 stars 42 forks source link

Possible deduplication solution that doesn't require a primary key #95

Open MrPowers opened 1 year ago

MrPowers commented 1 year ago

We'll have to translate this to Python:

val duplicates = df
  .select(<pk cols>)
  .withColumn("__file_path", col("_metadata.file_path"))
  .withColumn("__row_index", col("_metadata.row_index"))
  .withColumn(
    "rank", 
    row_number().over(
      Window()
        .partitionBy(<pk cols>)
        .orderBy(<pk cols>)))
  .filter("rank > 1")
  .drop("rank")

And then:

df.alias("old")
  .merge(
    duplicates.alias("new"),
    "old.<pk1> = new.<pk1> AND ... AND old.<pkn> = new.<pkn>" +
      " AND old._metadata.file_path = new.__file_path" +
      " AND old._metadata.row_index = new.__row_index")
  .whenMatchedDelete()
  .execute()
robertkossendey commented 1 year ago

Where is the row_index property documented?

robertkossendey commented 1 year ago

Ahh, found it! ;) https://issues.apache.org/jira/browse/SPARK-37980