Open asfimport opened 1 year ago
Gidon Gershinsky / @ggershinsky: Hmm, looks like this method runs over all columns, projected and not projected: org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191)
Please check if setting "parquet.split.files" to "false" solves this problem.
Vignesh Nageswaran: @ggershinsky thanks sir, it worked. Could you also please help me to understand about any adverse effects of setting is parameter
Gidon Gershinsky / @ggershinsky: Welcome.
From the sound of it, this might require each file to be processed by one thread only (instead of reading a single file by multiple threads); which should be ok in typical usecases where one thread/executor reads multiple files anyway. But I'll have a deeper look at this.
Vignesh Nageswaran:
@ggershinsky Sir, could you please let us know will there be any permanent fix, without setting the parameter parquet.split.files
to false
Gidon Gershinsky / @ggershinsky:
Yep, sorry about the delay. This turned out to be more challenging than I hoped; a fix at the encryption code level will require some changes in the format specification.. A rather big deal, and likely unjustified in this case. The immediate trigger is the checkDeltaByteArrayProblem
verification, added 8 years ago to detect encoding irregularities in older files. For some reason this check is done only on files with nested columns, and not on files with regular columns (at least in Spark). Maybe the right thing today is to remove that verification. I'll check with the community.
Gidon Gershinsky / @ggershinsky:
[~Nageswaran]
A couple of updates on this.
We should be able to skip this verification for encrypted files, a pull request is sent to parquet-mr.
Also, I've tried the new Spark 3.4.0 (as is, no modifications) with the scala test above - no exception was thrown. Probably, the updated Spark code bypasses the problematic parquet read path. Can you check if Spark 3.4.0 works ok for your usecase.
Vignesh Nageswaran:
@ggershinsky sorry for late reply. Yes sir spark 3.4.0 code works without setting the parameter parquet.split.files
to false. Thanks for raising a PR to skip the verification for encrypted files.
Hi Team,
While exploring parquet encryption, it is found that, if a field in nested column is encrypted , and If I want to read this parquet directory from other applications which does not have encryption keys to decrypt it, I cannot read the remaining fields of the nested column without keys.
Example
`
In the case class
SquareItem
,nestedCol
field is nested field and I want to encrypt a fieldic
within it. I also want the footer to be non encrypted , so that I can use the encrypted parquet file by legacy applications. Encryption is successful, however, when I query the parquet file using spark 3.3.0 without having any configuration for parquet encryption set up , I cannot non encrypted fields ofnestedCol
sic
. I was expecting that onlynestedCol
ic
field will not be querable. Reproducer. Spark 3.3.0 Using Spark-shell Downloaded the file parquet-hadoop-1.12.0-tests.jar and added it to spark-jars folder Code to create encrypted data. #Code to read the data trying to access non encrypted nested field by opening a new spark-shell
As you can see that nestedCol.sic is not encrypted , I was expecting the results, but I get the below error
Reporter: Vignesh Nageswaran
Related issues:
Note: This issue was originally created as PARQUET-2193. Please see the migration documentation for further details.