apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

Deep verification of encrypted files #2583

Open asfimport opened 3 years ago

asfimport commented 3 years ago

A tools that verifies encryption of parquet files in a given folder. Analyzes the footer, and then every module (page headers, pages, column indexes, bloom filters) - making sure they are encrypted (in relevant columns). Potentially checking the encryption keys.

We'll start with a design doc, open for discussion.

Reporter: Gidon Gershinsky / @ggershinsky Assignee: Maya Anderson / @andersonm-ibm

PRs and other links:

Note: This issue was originally created as PARQUET-1989. Please see the migration documentation for further details.

asfimport commented 1 year ago

Steve Loughran / @steveloughran: you might want to have a design which can do the scan on a spark rdd, where the rdd is simply the deep listFiles(path) scan of the directory tree. This would give the best scale for a massive dataset compared to even some parallelised scan in a single process.

I do have an RDD which can do line-by-line work, with locality of work determined on each file, which lets you schedule the work on the relevant hdfs nodes with the data; unfortunately it needs to be in the o.a.spark package to build https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala

...that could maybe be added to spark itself.

asfimport commented 7 months ago

Gang Wu / @wgtmac: Looks like this issue is not active for a long time, I will move it off from the 1.14.0 release target.