StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.9k stars 1.79k forks source link

query overloads machine #48590

Open eshishki opened 3 months ago

eshishki commented 3 months ago

I'm heavy user of iceberg catalog and also i use this patch https://github.com/StarRocks/starrocks/pull/47778 which can cache iceberg delete files.

When i enabled enable_datacache_async_populate_mode, i've got load average spikes that hanged the machine. Without enable_datacache_async_populate_mode, starrocks uses 1 thread per 8 cpus for datacache writes and everything is fine.

I would love to get async behavior that uses some thread pool maybe.

kevincai commented 3 months ago

@GavinMar

eshishki commented 3 months ago

it seems that it still hangs my machine with load average 200

eshishki commented 3 months ago

can not reliably reproduce the issue

another observation is that we were at the node memory limit and maybe that triggered cache writeback from mem to disk

i tried setting cache memory artificially low so that we write to disk early but that did not trigger the problem

eshishki commented 3 months ago

examining the logs, i see

W0719 09:40:51.603530 151778 pipeline_driver.cpp:556] begin to cancel operators for query_id=9a47cfea-45b1-11ef-9c5e-02f604b9c98d fragment_id=9a47cfea-45b1-11ef-9c5e-02f604b9c994 driver=driver_9_21, status=READY, o
perator-chain: [exchange_source_9_0x71ba86565c10(O) -> chunk_accumulate_9_0x71ba86566110(O) -> spillable_hash_join_probe_12_0x71ba2e6c0c10(O)(HashJoiner=0x71ba861db910) -> chunk_accumulate_12_0x71ba86566390(O) -> s
pillable_hash_join_probe_15_0x71ba2e6c1110(O)(HashJoiner=0x71ba865cc310) -> project_16_0x71b9c3e44c10(O) -> spillable_hash_join_probe_19_0x71ba2e6c1610(O)(HashJoiner=0x71ba865ccd10) -> project_20_0x71b9c42c2b10(O)
-> exchange_sink_21_0x71ba2e6c1b10(O)]

maybe it is not datacache at all, maybe it is spill process

kevincai commented 3 months ago

could be overloaded by disk r/w iops, if you have system-level monitor setup, take a look at the disk iops and throughput around the hanging timestamp.

eshishki commented 3 months ago

it is definitely related to io, but the only io it could do is data cache or spill since i’m querying iceberg catalog

i’m just trying to figure out how to make starrocks not to do io so furiously so that it hangs the machine