facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

Updating histogram metric causes performance issue when spilling as it is updated too frequently #11603

Open HolyLow opened 5 days ago

HolyLow commented 5 days ago

Description

When testing Velox spilling functionality, we found that the velox spilling would have performance issue. Although the spilling has a thread pool up to 8 threads, the actual CPU concurrency is only 1~2. image

Then we use tools like pstack to find what's wrong, and it turns out that all the spilling threads are bottlenecked by prometheusCpp's updateHistogramMetric calling, and the stack is as follows:

Thread 2242 (Thread 0x7f8aee23d640 (LWP 10173)):
#0  0x00007fa72a2a6dc0 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fa72a29f753 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007fa60f5c23e6 in std::__1::mutex::lock() () from /data/shared_libs/libnative_execution.so
#3  0x00007fa60fa31358 in prometheus::Summary::Observe(double) () from /data/shared_libs/libnative_execution.so
#4  0x00007fa60f9fc768 in facebook::presto::prometheus::PrometheusStatsReporter::addHistogramMetricValue(char const*, unsigned long) const () from /data/shared_libs/libnative_execution.so
#5  0x00007fa60f9fc91f in facebook::presto::prometheus::PrometheusStatsReporter::addHistogramMetricValue(folly::Range<char const*>, unsigned long) const () from /data/shared_libs/libnative_execution.so
#6  0x00007fa614234bca in facebook::velox::common::updateGlobalSpillAppendStats(unsigned long, unsigned long) () from /data/shared_libs/libnative_execution.so
#7  0x00007fa613c32177 in facebook::velox::exec::SpillWriter::updateAppendStats(unsigned long, unsigned long) () from /data/shared_libs/libnative_execution.so
#8  0x00007fa613c32083 in facebook::velox::exec::SpillWriter::write(std::__1::shared_ptr<facebook::velox::RowVector> const&, folly::Range<facebook::velox::IndexRange*> const&) () from /data/shared_libs/libnative_execution.so
#9  0x00007fa613c2c119 in facebook::velox::exec::SpillState::appendToPartition(unsigned int, std::__1::shared_ptr<facebook::velox::RowVector> const&) () from /data/shared_libs/libnative_execution.so
#10 0x00007fa613c3662e in facebook::velox::exec::Spiller::writeSpill(int) () from /data/shared_libs/libnative_execution.so
#11 0x00007fa613c3a494 in facebook::velox::AsyncSource<facebook::velox::exec::Spiller::SpillStatus>::runMake(std::__1::function<std::__1::unique_ptr<facebook::velox::exec::Spiller::SpillStatus, std::__1::default_delete<facebook::velox::exec::Spiller::SpillStatus> > ()>&) () from /data/shared_libs/libnative_execution.so
#12 0x00007fa613c39d66 in facebook::velox::AsyncSource<facebook::velox::exec::Spiller::SpillStatus>::move() () from /data/shared_libs/libnative_execution.so
#13 0x00007fa613c36bb2 in facebook::velox::exec::Spiller::runSpill(bool) () from /data/shared_libs/libnative_execution.so
#14 0x00007fa613c372eb in facebook::velox::exec::Spiller::spill(facebook::velox::exec::RowContainerIterator const*) () from /data/shared_libs/libnative_execution.so
#15 0x00007fa613d15a20 in facebook::velox::exec::GroupingSet::spill() () from /data/shared_libs/libnative_execution.so
#16 0x00007fa613d1e1a6 in facebook::velox::exec::HashAggregation::reclaim(unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /data/shared_libs/libnative_execution.so
#17 0x00007fa613c03534 in long std::__1::__function::__policy_invoker<long ()>::__call_impl[abi:ue170006]<std::__1::__function::__default_alloc_func<facebook::velox::exec::Operator::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::$_0, long ()> >(std::__1::__function::__policy_storage const*) () from /data/shared_libs/libnative_execution.so
#18 0x00007fa6142aa382 in facebook::velox::memory::MemoryReclaimer::run(std::__1::function<long ()> const&, facebook::velox::memory::MemoryReclaimer::Stats&) () from /data/shared_libs/libnative_execution.so
#19 0x00007fa613c03104 in facebook::velox::exec::Operator::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /data/shared_libs/libnative_execution.so
#20 0x00007fa613bfec58 in std::__1::unique_ptr<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult, std::__1::default_delete<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult> > std::__1::__function::__policy_invoker<std::__1::unique_ptr<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult, std::__1::default_delete<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult> > ()>::__call_impl[abi:ue170006]<std::__1::__function::__default_alloc_func<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::$_1, std::__1::unique_ptr<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult, std::__1::default_delete<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult> > ()> >(std::__1::__function::__policy_storage const*) () from /data/shared_libs/libnative_execution.so
#21 0x00007fa613bfea24 in facebook::velox::AsyncSource<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult>::runMake(std::__1::function<std::__1::unique_ptr<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult, std::__1::default_delete<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult> > ()>&) () from /data/shared_libs/libnative_execution.so

Turning off premetheusCpp metric collection would solve the problem for sure, but we want to see why this happens.

And after going through the code, we found that the Spiller's code is calling the updateHistogramMetric function too frequently. In Spiller.nextBatch, the code would call appendPartition() every 64 rows or every 256KB: https://github.com/facebookincubator/velox/blob/c13b8ed888aa785bff88d061f5b3bc00e9231961/velox/exec/Spiller.cpp#L459-L477 Function appendToPartition() would call finally try to update the histogramMetric as the stack above, which causes performance issue.

We also notice that the prestissimo also use prometheusCpp as the metricExporter, so we wonder if the prestissimo has encountered the same problem or not. Besides, to solve the problem, we want to know if it's possible for velox to use larger batch size in Spiller code, or to reduce the usage of updateHistogramMetric, especially in critical session (frequent metric updating zone).

I could submit related code patches if the velox community agrees the modification. Thanks a lot.

HolyLow commented 5 days ago

@xiaoxmeng @oerling Could you please take a look at the problem? Thanks a lot.

Yuhta commented 5 days ago

A better solution is to improve PrometheusStatsReporter to use a thread local state instead of global to eliminate contention.