Open asfimport opened 3 years ago
Steve Loughran / @steveloughran: you might want to have a design which can do the scan on a spark rdd, where the rdd is simply the deep listFiles(path) scan of the directory tree. This would give the best scale for a massive dataset compared to even some parallelised scan in a single process.
I do have an RDD which can do line-by-line work, with locality of work determined on each file, which lets you schedule the work on the relevant hdfs nodes with the data; unfortunately it needs to be in the o.a.spark package to build https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala
...that could maybe be added to spark itself.
A tools that verifies encryption of parquet files in a given folder. Analyzes the footer, and then every module (page headers, pages, column indexes, bloom filters) - making sure they are encrypted (in relevant columns). Potentially checking the encryption keys.
We'll start with a design doc, open for discussion.
Reporter: Gidon Gershinsky / @ggershinsky Assignee: Maya Anderson / @andersonm-ibm
PRs and other links:
Note: This issue was originally created as PARQUET-1989. Please see the migration documentation for further details.