Open eshishki opened 3 months ago
@GavinMar
it seems that it still hangs my machine with load average 200
can not reliably reproduce the issue
another observation is that we were at the node memory limit and maybe that triggered cache writeback from mem to disk
i tried setting cache memory artificially low so that we write to disk early but that did not trigger the problem
examining the logs, i see
W0719 09:40:51.603530 151778 pipeline_driver.cpp:556] begin to cancel operators for query_id=9a47cfea-45b1-11ef-9c5e-02f604b9c98d fragment_id=9a47cfea-45b1-11ef-9c5e-02f604b9c994 driver=driver_9_21, status=READY, o
perator-chain: [exchange_source_9_0x71ba86565c10(O) -> chunk_accumulate_9_0x71ba86566110(O) -> spillable_hash_join_probe_12_0x71ba2e6c0c10(O)(HashJoiner=0x71ba861db910) -> chunk_accumulate_12_0x71ba86566390(O) -> s
pillable_hash_join_probe_15_0x71ba2e6c1110(O)(HashJoiner=0x71ba865cc310) -> project_16_0x71b9c3e44c10(O) -> spillable_hash_join_probe_19_0x71ba2e6c1610(O)(HashJoiner=0x71ba865ccd10) -> project_20_0x71b9c42c2b10(O)
-> exchange_sink_21_0x71ba2e6c1b10(O)]
maybe it is not datacache at all, maybe it is spill process
could be overloaded by disk r/w iops, if you have system-level monitor setup, take a look at the disk iops and throughput around the hanging timestamp.
it is definitely related to io, but the only io it could do is data cache or spill since i’m querying iceberg catalog
i’m just trying to figure out how to make starrocks not to do io so furiously so that it hangs the machine
I'm heavy user of iceberg catalog and also i use this patch https://github.com/StarRocks/starrocks/pull/47778 which can cache iceberg delete files.
When i enabled enable_datacache_async_populate_mode, i've got load average spikes that hanged the machine. Without enable_datacache_async_populate_mode, starrocks uses 1 thread per 8 cpus for datacache writes and everything is fine.
I would love to get async behavior that uses some thread pool maybe.