delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.58k stars 1.7k forks source link

[Feature Request] Set userId, userName, notebook, clusterId in delta history #1259

Closed keen85 closed 2 years ago

keen85 commented 2 years ago

Feature request

Overview

Delta History schema features some attributes that are always NULL for me (Delta Lake 1.1 Spark 3.1):

Id like to set these attributes manually for write operations.

Motivation

This information would help technically keeping track of the changes. It would promote a better data lineage.

Further details

I could imagine two ways to implement my feature request:

  1. Implement additional options ("userId", "userName", "notebook", "clusterId") for the Spark writer just like the "userMetadata" option
    df.write.format("delta") \
    .mode("overwrite") \
    .option("userMetadata", "custom metadate") \
    .option("userId", "1337") \
    .option("userName", "Adam") \
    .option("notebook", "adams_notebook.ipynb") \
    .option("clusterId", "application_1234") \
    .save("/tmp/delta/people10m")
  2. Introduce some new spark configurations that are used for setting the attributes in the history schema. So users could set those configs to desired values just before invoking some write action.
    • spark.delta.userId
    • spark.delta.userName
    • spark.delta.notebook
    • spark.delta.clusterId

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Unfortunately I know nothing about Scala :(

allisonport-db commented 2 years ago

These attributes are remnant from before Delta was open-sourced. Now we generally don't want Databricks concepts like these as first class citizens within Delta.

Instead, you can save these sorts of attributes in the "userMetadata" (likely formatted as json). Let me know what you think.

keen85 commented 2 years ago

Hi @allisonport-db, That should work. However, the handling will not be as easy as it would be with dedicated attributes.

Imagine I'd like to search the history for all changes that were induced by one specific notebook. Since userMetadata contains a string, before filtering, you need to parse the json-string; but it is doable.

Out of curiosity: are there any plans to actually remove the "deprecated" attributes at some time from the history schema?

zsxwing commented 2 years ago

are there any plans to actually remove the "deprecated" attributes at some time from the history schema?

We don't plan to remove them as that would break compatibility. But we also don't plan to support more features on top of these deprecated attributes.

Closing this as we don't plan to support this.

Falydoor commented 1 year ago

I know this issue is closed, I'm only interested into the userName column as I think it would be cool to have it for audit purpose.

userMetadata can be used but it requires an extra config and can also be set with the "wrong" user for malicious purposes.

I believe the change to log the current user is pretty simple: https://github.com/delta-io/delta/blob/d7483ad5a5ad50cafbe74cbe9019be8f9389d8b4/core/src/main/scala/org/apache/spark/sql/delta/actions/actions.scala#L1032-L1035

Instead of returning None, returning Option(System.getProperty("user.name")) should do the trick.

Let me know what you think, I can provide a PR.

Falydoor commented 1 year ago

@zsxwing @allisonport-db Not sure if you saw my previous message.