Memory usage is larger than needed.

alexey-milovidov commented 4 years ago

The query

clickhouse-benchmark --max_threads 16 <<< "select WatchID from hits_100m_obfuscated group by WatchID limit 10"

will lead to sustained memory usage of 26 GiB despite the fact that the query only takes 4..5 GiB.

alexey-milovidov commented 4 years ago

Alexey Milovidov, [11.05.20 20:07]
Не знаю, иначе бы сразу закрыл :)

Скорее всего, особенность аллокатора (jemalloc) и тредпула.

В этом запросе используется two-level агрегация, 256 хэш-таблиц, каждая по 450 000 элементов примерно. Поэтому каждая хэш-таблица представляет собой не маленькую и не большую аллокацию (большая сразу пошла бы в mmap). jemalloc имеет thread cache (как любой уважающий себя аллокатор). Из-за тредпула, запрос использует разные потоки ОС и thread кэшей становится больше, чем нужно.

alexey-milovidov commented 4 years ago

Yes, the hypothesis is correct:

SELECT uniq(arrayJoin(thread_ids))
FROM system.query_log
WHERE (event_time > (now() - 300)) AND (query LIKE '%WatchID%') AND (event_date = today()) AND (type = 2)

┌─uniq(arrayJoin(thread_ids))─┐
│                          92 │
└─────────────────────────────┘

alexey-milovidov commented 4 years ago

Threads in ThreadPool should take jobs in preferred order, not randomly. @azat Do you interested in this issue?

azat commented 4 years ago

@alexey-milovidov Yes, will take a look

genzgd commented 4 years ago

Is this issue present on the current 2.3 release with experimental_processors turned off? Is there a current workaround?

alexey-milovidov commented 4 years ago

This is expected to present in old releases and to be independent of processors.

azat commented 4 years ago

Is there a current workaround?

@genzgd workaround will be to disable two-level aggregation, with:

group_by_two_level_threshold=0
group_by_two_level_threshold_bytes=0

(but note, that this means that external GROUP BY will not work, i.e. max_bytes_before_external_group_by)

alexey-milovidov commented 4 years ago

@azat It will slowdown group by approx 10..20 times.

genzgd commented 4 years ago

Thanks for the details.

azat commented 4 years ago

Threads in ThreadPool should take jobs in preferred order, not randomly.

Using thread based on the last executed callbacks (queue of few elements, to overcome periodic jobs scheduling) should work, do you have better ideas?

Also for queries like in your example (SELECT ... GROUP BY ... LIMIT <L>, i.e. no ORDER BY and GROUP BY specials) group_by_overflow_mode=any + max_rows_to_group_by=<L> can be forced (some kind of optimization), thoughts? (this is just thoughts, about such queries in general, not about the particular issue)

I also tried #10956 here and it does not helps, need to take a closer look (yes it shares the arena, but looks like the allocations is not big enough to use mmap directly? And yes, the build is not debug of course, so MMAP_THRESHOLD is 64<<20)

alexey-milovidov commented 4 years ago

Also for queries like in your example (SELECT ... GROUP BY ... LIMIT , i.e. no ORDER BY and GROUP BY specials) group_by_overflow_mode=any + max_rows_to_group_by= can be forced (some kind of optimization), thoughts?

Only if it's executed in a single thread on a single server.

azat commented 4 years ago

Is there a current workaround?

@genzgd is your original query has the same pattern? i.e.

has GROUP BY
has LIMIT
no ORDER BY

If so, then as a workaround you can use one of the following:

ORDER BY (any column, the most lightwegiht)
WITH TOTALS (will also work)

alexey-milovidov commented 4 years ago

@azat It will correct tracked memory amount but I doubt that RSS will be lower.

azat commented 4 years ago

@azat It will correct tracked memory amount but I doubt that RSS will be lower.

You are right (I'm writing long reply in the PR)

alexey-milovidov commented 4 years ago

will lead to sustained memory usage of 26 GiB despite the fact that the query only takes 4..5 GiB.

It was lowered to 13 GiB on current master. Still higher than needed.

alexey-milovidov commented 8 months ago

RSS is 27 GB on my machine:

clickhouse-benchmark --max_threads 16 <<< "select WatchID from hits group by WatchID limit 10"

ClickHouse / ClickHouse

Memory usage is larger than needed. #10818