apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.09k stars 913 forks source link

[Bug] When using kyuubi to do data masking, an error is reported #4341

Closed Jackhjf closed 1 year ago

Jackhjf commented 1 year ago

Code of Conduct

Search before asking

Describe the bug

Configure the mask policy on the ranger, and when executing sql through Spark, an error is reported

image

Affects Version(s)

master

Kyuubi Server Log Output

spark-sql> select name from iceberg_spark.test_copy;
Error in query: Resolved attribute(s) name#6 missing from id#5,name#7 in operator !Project [name#6]. Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.;
!Project [name#6]
+- SubqueryAlias spark_catalog.iceberg_spark.test_copy
   +- Project [id#5, md5(cast(name#6 as binary)) AS name#7]
      +- RowFilterAndDataMaskingMarker
         +- RelationV2[id#5, name#6] spark_catalog.iceberg_spark.test_copy

### Kyuubi Engine Log Output

_No response_

### Kyuubi Server Configurations

_No response_

### Kyuubi Engine Configurations

```yaml
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension

Additional context

No response

Are you willing to submit PR?

github-actions[bot] commented 1 year ago

Hello @Jackhjf, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.

bowenliang123 commented 1 year ago

Duplicated known issue as in https://github.com/apache/kyuubi/issues/4202 when applying column masking on DataSourceV2Relation. Suggested ut and fix in https://github.com/apache/kyuubi/pull/4304, but it's in pending status. More work is required to do.

bowenliang123 commented 1 year ago

Fixed in https://github.com/apache/kyuubi/pull/4358. Please have a check. @Jackhjf

Jackhjf commented 1 year ago

Fixed in #4358. Please have a check. @Jackhjf

Thanks for your work, I can get the correct result with the following use case; @bowenliang123 @yaooqinn

create table yhbi.test_copy(id string,name string) using iceberg ; create table yhbi.test_copy1(id1 string,name1 string) using iceberg ; create table yhbi.test1(id string,name string) using iceberg(name is the data masking field; create table iceberg_spark.test_copy(id string,name string) using iceberg (name is the data masking field);

001

insert into yhbi.test_copy select *from iceberg_spark.test_copy

002

insert into yhbi.test_copy1 select *from iceberg_spark.test_copy

003

insert overwrite yhbi.test_copy1 select id,name as name1 from iceberg_spark.test_copy

004

insert overwrite yhbi.test_copy1 select id ,name as name1 from iceberg_spark.test_copy a where a.id in ("111111","11111");

005

insert overwrite yhbi.test_copy1 select id ,name as name1 from iceberg_spark.test_copy order by id, name;

006

select max(name) from iceberg_spark.test_copy; select min(name) from iceberg_spark.test_copy;

007

insert overwrite yhbi.test_copy1 select id ,count(name) from iceberg_spark.test_copy group by id; insert overwrite yhbi.test_copy1 select count(id) ,name from iceberg_spark.test_copy group by name;

008

insert overwrite yhbi.test_copy1 select b.id,b.name from iceberg_spark.test_copy a join iceberg_spark.test1 b on a.id=b.id;

009 insert overwrite yhbi.test_copy1 select id,name as name1 from iceberg_spark.test_copy union select *from (select a.id ,a.name from iceberg_spark.test1 a join iceberg_spark.test_copy b on a.id=b.id) c;

yaooqinn commented 1 year ago

thanks @Jackhjf for the validation

Jackhjf commented 1 year ago

thanks @Jackhjf for the validation now i found a problem,if I excute this sql: select id ,name from iceberg_spark.test_copy a where a.name like "中%"; I will get an empty result

image image

@yaooqinn

bowenliang123 commented 1 year ago

thanks @Jackhjf for the validation now i found a problem,if I excute this sql: select id ,name from iceberg_spark.test_copy a where a.name like "中%"; I will get an empty result

image image

@yaooqinn

I think this is the exact result that you would expect with the column masking. The masking is properly pushed down to leaf node of scanning the source. And in this case, it prevents reading raw data without masking rules.

yaooqinn commented 1 year ago

You have a mask rule on the column - name -, which will be masked during scan operation, so the operations that follow will only get the masked data.

I know it's a bit counterintuitive. But it is correct. Supposing we don't mask ahead and change your test case to select id ,name from iceberg_spark.test_copy a where a.name ='Jackhjf'. Users get the result md5('Jackhjf'), but everyone knows the datasource contains records for Jackhjf. In such a case, the data is not secured, right?

Jackhjf commented 1 year ago

You have a mask rule on the column - name -, which will be masked during scan operation, so the operations that follow will only get the masked data.

I know it's a bit counterintuitive. But it is correct. Supposing we don't mask ahead and change your test case to select id ,name from iceberg_spark.test_copy a where a.name ='Jackhjf'. Users get the result md5('Jackhjf'), but everyone knows the datasource contains records for Jackhjf. In such a case, the data is not secured, right?

From this point of view, the data is really not safe.thanks