Parquet File not readable by Google big query (works with Spark)

asfimport commented 3 years ago

Hi I'm trying to write Avro message to parquet on GCS. These parquet should be query by big query engine who support now parquet.

To do this I'm using Secor a kafka log persister tools from pinterest.

First I didn't notice any problem using Spark the same file can be read without any problem all is working perfect.

Now using Big query bring and error like this : Error while reading table: , error message: Read less values than expected: Actual: 29333, Expected: 33827. Row group: 0, Column: , File:

After investigation using parquet-tools I figured out that in parquet there is metadata regarding number total of unique values for each columns eg from parquet-tools page 0: DLE:BIT_PACKED RLE:BIT_PACKED [more]... CRC:[PAGE CORRUPT] VC:547

So the VC value indicate that the total number of unique value in the file is 547.

Now when make a spark SQL like SELECT DISTINCT COUNT(column) FROM ... I get 421 mean this number in the metadata is incorrect.

So what is not a problem for Spark to read is a blocking problem for Big data because it relies on these values and found it incorrect.

Is there any configuration of the writer that can prevent these errors in the metadata ? Or is it a normal behavior that should be a problem ?

Thanks

Environment: [secor|https://github.com/pinterest/secor]

GCP

Big Query google cloud

Parquet writer 1.11

Reporter: Richard Grossman

_{Note: This issue was originally created as PARQUET-1946. Please see the migration documentation for further details.}

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: There are no statistics/metadata in the parquet specification that is related to unique values. The value displayed by parquet-tools (after VC) is the value count (the number of values in the related page). So, depending on the actual data both 547 value count and 421 distinct value count can be valid in the same time.

I don't know what parquet implementation does Big Query use. If Spark can read the data from the same file properly I would suggest creating an issue for Big Query to investigate why they cannot read that parquet file.

asfimport commented 3 years ago

Dongjoon Hyun / @dongjoon-hyun: Hi, [~richiesgr] This is only for Parquet 1.11.0 right? Did you try to use Parquet 1.11.1?

asfimport commented 3 years ago

Dongjoon Hyun / @dongjoon-hyun: BTW, Spark 3.0/2.4 use Parquet 1.10.1.

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: Are you using V2 datapages? BQ doesn't yet support them.

asfimport commented 3 years ago

Richard Grossman: Hi

How could I know the version of datapages that have been used I get this from parquet tools

creator: parquet-mr version 1.11.1 (build 765bd5cd7fdef2af1cecd0755000694b992bfadd)

Does it mean it's V1 or V2 datapages ?

asfimport commented 3 years ago

Richard Grossman: Hi

May be you can help me.

I would like to provide a file as sample to google to check why they cannot read the parquet file unfortunately the files contains PII informations and cannot be shared as is.

Is there any way to strip PII fields from parquet file to be able to share it with them ?

Thanks

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: I'm not an expert on the tool. Looking through the source code I don't think it says anything explicitly. One way to get a hint is to look at column encodings. If they include delta encodings it is likely data page v2.

asfimport commented 3 years ago

Yuming Wang / @wangyum: Could you try to disable parquet.filter.columnindex.enabled?

apache / parquet-java

Parquet File not readable by Google big query (works with Spark) #2550