Open bk-mz opened 10 months ago
@bk-mz yes, mor not support parquet native bloom filter, because log file will merge on read, so native bloom filter is not the latest, is not accurate, only cow
or mor read_optimized
can use it.
And in version 0.14.0, bloom filter in hudi only be used in write to tag record.
mor read_optimized can use it.
can i set spark-sql to use read_optimized to test it out?
@bk-mz yes, set hoodie.datasource.query.type = read_optimized
Okay so let's compare. For clean experiment, I created 2 separate sessions for queries below.
scala> spark.time({
| val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "read_optimized")
| .load("s3://path/table/")
|
| val count = df.filter(
| (df("year") === 2024) &&
| (df("month") === 1) &&
| (df("day") === 16) &&
| (df("account_id") === "id1")
| ).count()
|
| println(s"Count: $count")
| })
Count: 47
Time taken: 30477 ms
scala> spark.time({
| val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "snapshot")
| .load("s3://path/table/")
|
| val count = df.filter(
| (df("year") === 2024) &&
| (df("month") === 1) &&
| (df("day") === 16) &&
| (df("account_id") === "id1")
| ).count()
|
| println(s"Count: $count")
| })
24/01/18 10:06:51 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Count: 47
Time taken: 22594 ms
It's just super confusing as it contradicts the logic. So read_optimized
actually takes more time to load same data as it's done with snapshot
.
Can we say for sure that use of bloom filters on parquet native filters is bluntly not effective for hudi?
@bk-mz the cache of the operating system may also have an impact, can you provide detailed metrics for spark ui?
Sure, but anything specific you want to see?
@bk-mz you can see scan rdd the number of output rows
in spark sql tag ui.
for snapshot: 441,483,112, query time 28141ms for read-optimized: 22,887,045, query time 26054ms.
scala> spark.time({
| val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "read_optimized")
| .load("s3://table/")
|
| val count = df.filter(
| (df("year") === 2024) &&
| (df("month") === 1) &&
| (df("day") === 16) &&
| (df("account_id") === "id1")
| ).count()
|
| println(s"Count: $count")
| })
24/01/22 09:05:38 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Count: 47
Time taken: 26054 ms
scala> spark.time({
| val df = spark.read
| .format("org.apache.hudi")
| .option("hoodie.datasource.query.type", "snapshot")
| .load("s3://table/")
|
| val count = df.filter(
| (df("year") === 2024) &&
| (df("month") === 1) &&
| (df("day") === 16) &&
| (df("account_id") === "id1")
| ).count()
|
| println(s"Count: $count")
| })
24/01/22 09:09:03 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Count: 47
Time taken: 28141 ms
Okay, your point stands, the number of output rows are indeed different.
Though, how can we explain same query times?
@bk-mz can you see the cost time in this point?
we can only analyse the scan rdd. A query contains time consumption in various aspects. the result I think is normal.
WholeStageCodegen (1) duration: total (min, med, max )13.4 m (79 ms, 1.5 s, 3.4 s )
for snapshot.
WholeStageCodegen (1) duration: total (min, med, max )6.5 m (249 ms, 552 ms, 5.9 s )
for read-optimized
@KnightChess Did I understand you correctly, you are claiming that bloom filters actually work correctly?
@bk-mz yes, according to the indicators, it is work
how can we clarify that the difference is not cause by read-optimized and snapshot paths excluding any bloom filters on indexes?
I.e. it's caused by a RO reader just reading different files?
There will be a variety of factor leading to the difference time in the query, like IO、cpu、dick load... in spark, like parallelism, the expand time of executor..., in hudi, snapshot reading should be slow than read-optimized theoretically, and they use diff reader to read diff file( ro base or rt base+log file). And there is another problem, does parquet file with bloom filter will faster than without bloom filter in reading? I don't think it is certain, you need to look at its actual production effect. In spark query, the difference between 2S cannot explain the slow problem. What do you think about, this is my shallow cognition, maybe others have better opinion.
What do you think about,
TBH a bit of mixed emotions here.
With 0.14 there is practically no way in understanding how indexing or statistical means are affecting queries apart from "output number of rows" in Spark SQL dataframe, i.e. are they used at all and if they are, how effectively?
This issue could be closed, from out end we'll move further with assumption that indexing and statistical means in hudi are ineffective, though we'd enable them on our critical fields in case further releases of hudi would implement performance improvements.
@bk-mz Why do you think "indexing and statistical means in hudi are ineffective" when number of output rows with bloom is clearly lot less than number of output rows without bloom. You can also try column stats indexing also in this case. That will optimise your read queries.
when number of output rows with bloom is clearly lot less than number of output rows without bloom.
@ad1happy2go
The query performance is same for both ro and snapshot cases, therefore I'm making that statement. Just having one number smaller than other number is cryptic.
You can also try column stats indexing also in this case.
As you can see, they are enabled:
hoodie.metadata.index.bloom.filter.column.list=id,account_id
hoodie.metadata.index.bloom.filter.enable=true
hoodie.metadata.index.column.stats.column.list=id,account_id
hoodie.metadata.index.column.stats.enable=true
My concern with Hudi and in this ticket specifically, that today Hudi does not allow you to introspect and figure out that any statistical or indexing solution is actually improving performance.
We can't tie hudi configurations with actual results, they are logically not connected as seen from queries above.
I.e. I can't say "ok I removed that configuration and my query started to lag", nor vice-versa, I also can't say "I added that column in statistics config and my queries are faster now", because there are no metrics nor practical evidences from anywhere helping to understand the cause.
Hi @bk-mz . Wanted to add to this thread. Query latency may not be the only metric to measure like explained in the above threads. The runs with parquet native bloom filters enabled and still taking similar time could be dominated by few factors: the need to still open all files to load the parquet native bloom filter, S3 throttling etc.
One way I would try testing this is to remove Hudi from the picture and take the same parquet dataset, and run it with and without parquet native bloom filter enabled. You should be able to see the output rows reduced, but the query time may not be that improved due to the need to load each of these files to read the bloom filters.
The Column stats in Hudi's metadata table helps to reduce the number of files scanned (unlike parquet native bloom filters). With data skipping enabled, Hudi uses the column stats stored in the metadata table instead of scanning the metadata in each parquet file, so Hudi can better plan the query with such stats and the predicates by scanning/reading fewer files when possible (see this blog for more details on data skipping in Hudi). This is particularly helpful on cloud storage as cloud storage requests have constant overhead and are subject to rate limiting.
You bring valid feedback that we will take and work on - better showcasing the impact of using these indexes so the users can easily spot them. Will update you back on how we are incorporating this shortly.
Hi @bk-mz thanks for the interest in parquet bloom filter. We have an open documentation PR about bloom filters which states:
So bloom would be useful in either case (at the parquet file level) :
- the column has no duplicates
- the column number of unique values is more than 40k
I would add to this, that the benefit of bloom is when the predicate can be filtered out by the bloom. If it is the case, then you could also tune the NDV (number of distinct values) to decrease the probability of false positive match.
If your column is not in this case, then parquet bloom would only add overhead, and would slow down a given query.
There is also benchmarks on spark side that could be of interest
Describe the problem you faced
We encountered an issue with MOR table that utilizes metadata bloom filters and Parquet bloom filters, and has enabled statistics. When attempting to query data, the system does not seem to utilize these bloom filters effectively. Instead, all requests result in a full partition scan, regardless of the applied filters.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The expected behavior is that querying the column with bloom filters (BF) should be significantly more efficient than querying the column without bloom filters (non-BF).
Environment Description
Additional context
Table write hudi params:
Hadoop parquet properties:
If I download the file from s3 and then use parquet cli, it will show that BF on column is actually used:
Read part:
In this particular case
table_no_bfs
does not have any bloom filters for this day and for some reason takes more time that table w/o BFs.Number of rows in the table for this partition:
Spark SQL UI for BF table:
Amount of parquet files in the partition:
Spark SQL UI for Non-BF table:
Amount of parquet files: