Performance overhead on HiveConnector::createDataSource if split preloading kicked in

zhztheplayer commented 3 weeks ago

Bug description

It's observed in Gluten's use case, HiveConnector::createDataSource slows down data scan when split preload is turned on.

In the case a hotspot appeared in filter expression compilation (namely SimpleExpressionEvaluator::compile):

In the case the filter expression contains bloom-filters so it took much longer time than usual since bloom-filter's compilation can be slower than other types of expressions.

In the case when split preloading is turned off, the scan time can be shorten by ~6x (~30s vs ~5s). The estimated total split number is ~200K.

To solve the issue, perhaps split-preloading procedure could adopt some kind of reuse logics to avoid compiling the expressions every time a split is preloaded.

System information

Velox System Info v0.0.2
Commit: b98b20414faf414775f506b6d446f314cb27fe38
CMake Version: 3.29.2
System: Linux-5.4.0-156-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 9.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 12.0.0
CMake Prefix Path: /usr/local;/usr;/;/opt/cmake;/usr/local;/usr/X11R6;/usr/pkg;/opt

zhztheplayer commented 3 weeks ago

cc @oerling @Yuhta @mbasmanova

Yuhta commented 3 weeks ago

Can you put the bloom filter in dynamic filters instead? It does not sound right to put a bloom filter in remaining filter. Where is the bloom filter generated?

zhztheplayer commented 3 weeks ago

It does not sound right to put a bloom filter in remaining filter.

What's the ideal use case for remaining filter? I am not pretty much familiar with this part of design in Velox.

Where is the bloom filter generated?

It's generated by Spark query planner before creating Velox task.

Yuhta commented 3 weeks ago

Remaining filter is part of the where clause that cannot be converted to tuple domains (key value filters). In your case it is probably easier to wrap the bloom filter in a Filter object and pass it using TableScan::addDynamicFilter before you kick off the task.

Yuhta commented 3 weeks ago

A second way is to inherit compiled expression in HiveDataSource::setFromDataSource

zhli1142015 commented 3 weeks ago

Or make Bloom Filter lazy initialized, looks like something like below? Thanks.

template <typename T>
struct BloomFilterMightContainFunction {
  VELOX_DEFINE_FUNCTION_TYPES(T);

  using Allocator = std::allocator<uint64_t>;

  void initialize(
      const std::vector<TypePtr>& /*inputTypes*/,
      const core::QueryConfig&,
      const arg_type<Varbinary>* serialized,
      const arg_type<int64_t>*) {
    if (serialized != nullptr) {
      serialized_ = serialized->str();
    }
  }

  FOLLY_ALWAYS_INLINE void
  call(bool& result, const arg_type<Varbinary>&, const int64_t& input) {
    if (serialized_.has_value()) {
      bloomFilter_.merge(serialized_.value().c_str());
      serialized_ = std::nullopt;
    }
    result = bloomFilter_.isSet()
        ? bloomFilter_.mayContain(folly::hasher<int64_t>()(input))
        : false;
  }

 private:
  BloomFilter<Allocator> bloomFilter_;
  std::optional<std::string> serialized_;
};

Yuhta commented 3 weeks ago

@zhli1142015 If it is the same in all data sources, recompiling them is a waste of CPU

zhli1142015 commented 3 weeks ago

I see, make sense. Then how about store the DataSource created in preload threads in some collection and reuse them? Thanks.

Yuhta commented 3 weeks ago

Reusing them will be tricky. The most straightforward way would be wrap the bloom filter in filter object and push them into dynamic filters instead of modeling them as expression (it seems more logically right that way as well). Otherwise keeping them as expression will be a fairly large change on connector to factor out the remaining filter compilation (the connector interface might be hard to change to accommodate this). We are not seeing compilation cost anywhere else so it's a question whether it is worthing doing it just for spark bloom filter.

facebookincubator / velox

Performance overhead on HiveConnector::createDataSource if split preloading kicked in #10173

Bug description

System information