apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.55k stars 2.26k forks source link

Support row filter & column masking in REST spec #10909

Open shohamyamin opened 3 months ago

shohamyamin commented 3 months ago

Feature Request / Improvement

Summary:

We would like to request the addition of a new feature in the Iceberg REST catalog that would allow the catalog to return a row filter expression for a table and a column mask expression for each column.

Rationale:

This feature would enable query engines, such as Trino and Spark, to obtain crucial information from the catalog regarding how to handle requested resources. Specifically, it would inform the engines if any filtering or masking is required when accessing the data, ensuring that sensitive information is appropriately protected and that data access policies are consistently enforced.

Proposed Implementation:

Row Filter Expression: For each table, the REST catalog should be able to return an expression that defines the rows that should be visible to the querying entity. Column Mask Expression: For each column, the REST catalog should return an expression that defines how the column's data should be masked before it is made available to the query engine. Benefits:

Consistency Across Engines: By centralizing the row filtering and column masking logic in the catalog, all supported query engines (Trino, Spark, etc.) will handle data access uniformly, reducing the risk of inconsistencies.

Security: This feature enhances data security by ensuring that sensitive data is filtered or masked before being accessed by different query engines.

Simplified Data Governance: It simplifies the enforcement of data governance policies by allowing them to be defined once in the catalog and applied consistently across all query engines.

Conclusion:

Implementing this feature would greatly improve the integration of Iceberg with various query engines by providing a standardized way to enforce data access policies. We believe this would be a valuable addition to the Iceberg ecosystem and would help drive broader adoption of Iceberg as a unified data platform.

Query engine

None

Willingness to contribute

nqvuong1998 commented 3 months ago

cc @nastra

nastra commented 3 months ago

@shohamyamin you might want to take a look at https://iceberg.apache.org/contribute/#what-is-an-improvement-proposal and write up a proposal and then open a DISCUSS thread with this topic on the mailing list

amitgilad3 commented 3 months ago

Hi @shohamyamin, if you want i am willing to work with you on writing the proposal??

shohamyamin commented 3 months ago

@amitgilad3 That would be great

hereisharish commented 3 months ago

Hi @shohamyamin, @amitgilad3 we have been looking for a similar feature to enforce consistent data access policies for row filters and column masks across query engines, I'd like to collaborate with you on this feature.

amitgilad3 commented 3 months ago

No problem, always happy to have more people help, this is the initial proposal, please take a look and comment proposal please review @hereisharish @shohamyamin

hereisharish commented 3 months ago

@amitgilad3, it would be nice to have dynamic column masking, similar to whats offered in Trino, Hive, Spark. This would allow any UDF or function to be applied based on the user or role. For example, a query like SELECT ssn, name FROM tab1 could be dynamically rewritten to SELECT encrypt(ssn), name FROM tab1 based on the user's permissions, enabling more flexible and secure data access. While the column masking is primarily focused on the Unified Data Governance, this can also allow to re-write column using any UDF or function.