apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.6k stars 1.41k forks source link

Expected distinct numbers is not parsed correctly #2451

Open asfimport opened 4 years ago

asfimport commented 4 years ago

In the bloom filter feature, when I pass the expected distinct numbers as below, I got null values instead of 1000 and 200.


import org.apache.hadoop.conf.Configuration;

Configuration conf = new Configuration();

conf.set("parquet.bloom.filter.column.names", "content,line"); conf.set("parquet.bloom.filter.expected.ndv","1000,200");

  The issue is coming from getting the system property of expected distinct numbers through [Long.getLong(expectedNDVs[i])|https://github.com/apache/parquet-mr/blob/a737141a571e3cb6cee2c252dc4406e26e6c1177/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L251].

 

It's possible to fix it by parsing the string with Long.parseLong(expectedNDVs[i]).

 

Reporter: Walid Gara / @garawalid Assignee: Walid Gara / @garawalid

PRs and other links:

Note: This issue was originally created as PARQUET-1787. Please see the migration documentation for further details.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: I'm working on a general concept of allowing configuration to be set for specific columns. See PARQUET-1784 for details. What do you think of having the mentioned configuration as follows?


conf.set("parquet.bloom.filter.enabled", false); // Might not be required as this is the default
conf.set("parquet.bloom.filter.enabled#content", true); // Might not be necessary as by setting the expected ndv you explicitly sets this one
conf.set("parquet.bloom.filter.enabled#line", true); // Might not be necessary as by setting the expected ndv you explicitly sets this one
conf.set("parquet.bloom.filter.expected.ndv#content", 1000);
conf.set("parquet.bloom.filter.expected.ndv#line", 200);

This might require more writing but more clear and less error prone.

asfimport commented 4 years ago

Walid Gara / @garawalid: I left you a comment inside PARQUET-1784. I think it's better to keep the discussion there.