Open hussein-awala opened 8 months ago
Thanks for reporting @hussein-awala I'm taking a look, I forgot if there's some limitation preventing us from supporting bloom filters on nested fields. At least we can see that there is not a limitation on the parquet side by the test you did.
@amogh-jahagirdar I created https://github.com/apache/iceberg/pull/9902 to test if the bloom filters are added to the files, and they seem to be added for the nested field.
I will try to create my table with different catalogs to check if it's related to the catalog. Also, it could be related to how the data are written, where these tests use the FileAppender
directly, I will try to use Spark API in these tests to write the data.
Sounds great! I'll take a look at the PR.
I spent some time debugging this today, and added a test in the same class to try and repro and I saw the same thing. In ColumnWriteStoreBase
we are actually initializing the parquet BloomFilterWriter
with a valid bloomfilter for the nested type. So the table property for nested types is being passed through correctly.
As you said maybe the Spark APIs goes through a different path which ends up somehow losing the configuration. I doubt the catalog changes anything since this is more about the write path but feel free to go ahead and try it out.
But this is promising in the sense that we have already support this (our Parquet dependency supports it etc), we just need to identify why in the Spark API (or whatever mechanism used in the issue description) does not write the bloom filter for nested types.
Also curious which Spark version are you using? I just tested via Spark 3.4 and Spark 3.5, and bloom filters for nested type appears to be written out based on the parquet-cli output (just a struct with a single integer field).
I use Spark 3.5, Iceberg 1.4.3, and Glue Catalog.
and bloom filters for nested type appear to be written out
Interesting, I will retest it on Monday morning.
@amogh-jahagirdar Today I found out that I had this issue on a single table, I tried with nested and root fields, with single and multiple bloom filters, and none of them worked. This table contains a large number of columns (over 100 columns), I don't know yet if this is related to the issue. I will continue my investigation and update the issue once I find its source.
I think #9902 is ready to merge.
Hello @hussein-awala , if you're testing with a relatively small table with a small number of distinct values, Spark may be using dictionary encoding for the values. We have discovered in our testing that if Spark is able to dictionary encode the values in the parquet file, it will not write the bloom filter (which is by design).
@hussein-awala which tool are you using to inspect the parquet files? You mentioned parquet-cli
, but a google search leads to https://github.com/chhantyal/parquet-cli and/or https://pypi.org/project/parquet-cli/, which does not seem to offer the same API.
Plus, are you guys aware of any document that describes bloom filter support for Map/Struct type?
It is just a question, I am not in the issue loop.
I've tested on latest main branch and parquet bloom filter works with nested fields. Checked with parquet-cli
> alias parquet="java -cp 'target/parquet-cli-1.13.1.jar:target/dependency/*' org.apache.parquet.cli.Main"
> parquet bloom-filter ~/Downloads/test.parquet -c id_nested.nested_id -v 30
Row group 0:
--------------------------------------------------------------------------------
value 30 maybe exists.
Row group 1:
--------------------------------------------------------------------------------
value 30 NOT exists.
Row group 2:
--------------------------------------------------------------------------------
value 30 NOT exists.
❯ parquet bloom-filter ~/Downloads/test.parquet -c id -v 30
Row group 0:
--------------------------------------------------------------------------------
value 30 maybe exists.
Row group 1:
--------------------------------------------------------------------------------
value 30 NOT exists.
Row group 2:
--------------------------------------------------------------------------------
value 30 NOT exists.
I've also been experimenting with Bloom filters, and managed to get it working fairly easily with a nested field:
ALTER TABLE glue_catalog.kafka_archive.test_topic
SET TBLPROPERTIES ('write.parquet.bloom-filter-enabled.column.kafka_metadata.key'='true')
Then, after downloading a sample data file:
$ parquet bloom-filter -c kafka_metadata.key -v foo,4519160c-7d7c-44ae,bar /tmp/00001.parquet
Row group 0:
--------------------------------------------------------------------------------
value foo NOT exists.
value 4519160c-7d7c-44ae maybe exists.
value bar NOT exists.
Apache Iceberg version
1.4.3 (latest release)
Query engine
Spark
Please describe the bug 🐞
I have an Iceberg table, and I want to create two bloom filters on a root string column and nested string column in a struct, I've set the properties
write.parquet.bloom-filter-enabled.column.a
andwrite.parquet.bloom-filter-enabled.column.b.c
totrue
, and I checked withparquet-cli
:However, I tried with Spark and parquet, and it worker without any issue:
Check with
parquet-cli