Open zhztheplayer opened 3 weeks ago
cc @oerling @Yuhta @mbasmanova
Can you put the bloom filter in dynamic filters instead? It does not sound right to put a bloom filter in remaining filter. Where is the bloom filter generated?
It does not sound right to put a bloom filter in remaining filter.
What's the ideal use case for remaining filter? I am not pretty much familiar with this part of design in Velox.
Where is the bloom filter generated?
It's generated by Spark query planner before creating Velox task.
Remaining filter is part of the where clause that cannot be converted to tuple domains (key value filters). In your case it is probably easier to wrap the bloom filter in a Filter
object and pass it using TableScan::addDynamicFilter
before you kick off the task.
A second way is to inherit compiled expression in HiveDataSource::setFromDataSource
Or make Bloom Filter
lazy initialized, looks like something like below? Thanks.
template <typename T>
struct BloomFilterMightContainFunction {
VELOX_DEFINE_FUNCTION_TYPES(T);
using Allocator = std::allocator<uint64_t>;
void initialize(
const std::vector<TypePtr>& /*inputTypes*/,
const core::QueryConfig&,
const arg_type<Varbinary>* serialized,
const arg_type<int64_t>*) {
if (serialized != nullptr) {
serialized_ = serialized->str();
}
}
FOLLY_ALWAYS_INLINE void
call(bool& result, const arg_type<Varbinary>&, const int64_t& input) {
if (serialized_.has_value()) {
bloomFilter_.merge(serialized_.value().c_str());
serialized_ = std::nullopt;
}
result = bloomFilter_.isSet()
? bloomFilter_.mayContain(folly::hasher<int64_t>()(input))
: false;
}
private:
BloomFilter<Allocator> bloomFilter_;
std::optional<std::string> serialized_;
};
@zhli1142015 If it is the same in all data sources, recompiling them is a waste of CPU
I see, make sense. Then how about store the DataSource
created in preload threads in some collection and reuse them?
Thanks.
Reusing them will be tricky. The most straightforward way would be wrap the bloom filter in filter object and push them into dynamic filters instead of modeling them as expression (it seems more logically right that way as well). Otherwise keeping them as expression will be a fairly large change on connector to factor out the remaining filter compilation (the connector interface might be hard to change to accommodate this). We are not seeing compilation cost anywhere else so it's a question whether it is worthing doing it just for spark bloom filter.
Bug description
It's observed in Gluten's use case,
HiveConnector::createDataSource
slows down data scan when split preload is turned on.In the case a hotspot appeared in filter expression compilation (namely
SimpleExpressionEvaluator::compile
):In the case the filter expression contains bloom-filters so it took much longer time than usual since bloom-filter's compilation can be slower than other types of expressions.
In the case when split preloading is turned off, the scan time can be shorten by ~6x (~30s vs ~5s). The estimated total split number is ~200K.
Related code:
https://github.com/facebookincubator/velox/blob/3eb9f011f4292f6365203d61b5442d34cab92182/velox/exec/TableScan.cpp#L290-L329
To solve the issue, perhaps split-preloading procedure could adopt some kind of reuse logics to avoid compiling the expressions every time a split is preloaded.
System information