Open alexey-milovidov opened 4 years ago
Alexey Milovidov, [11.05.20 20:07]
Не знаю, иначе бы сразу закрыл :)
Скорее всего, особенность аллокатора (jemalloc) и тредпула.
В этом запросе используется two-level агрегация, 256 хэш-таблиц, каждая по 450 000 элементов примерно. Поэтому каждая хэш-таблица представляет собой не маленькую и не большую аллокацию (большая сразу пошла бы в mmap). jemalloc имеет thread cache (как любой уважающий себя аллокатор). Из-за тредпула, запрос использует разные потоки ОС и thread кэшей становится больше, чем нужно.
Yes, the hypothesis is correct:
SELECT uniq(arrayJoin(thread_ids))
FROM system.query_log
WHERE (event_time > (now() - 300)) AND (query LIKE '%WatchID%') AND (event_date = today()) AND (type = 2)
┌─uniq(arrayJoin(thread_ids))─┐
│ 92 │
└─────────────────────────────┘
Threads in ThreadPool should take jobs in preferred order, not randomly. @azat Do you interested in this issue?
@alexey-milovidov Yes, will take a look
Is this issue present on the current 2.3 release with experimental_processors turned off? Is there a current workaround?
This is expected to present in old releases and to be independent of processors.
Is there a current workaround?
@genzgd workaround will be to disable two-level aggregation, with:
group_by_two_level_threshold=0
group_by_two_level_threshold_bytes=0
(but note, that this means that external GROUP BY will not work, i.e. max_bytes_before_external_group_by
)
@azat It will slowdown group by approx 10..20 times.
Thanks for the details.
Threads in ThreadPool should take jobs in preferred order, not randomly.
Using thread based on the last executed callbacks (queue of few elements, to overcome periodic jobs scheduling) should work, do you have better ideas?
Also for queries like in your example (SELECT ... GROUP BY ... LIMIT <L>
, i.e. no ORDER BY
and GROUP BY
specials) group_by_overflow_mode=any
+ max_rows_to_group_by=<L>
can be forced (some kind of optimization), thoughts? (this is just thoughts, about such queries in general, not about the particular issue)
I also tried #10956 here and it does not helps, need to take a closer look (yes it shares the arena, but looks like the allocations is not big enough to use mmap
directly? And yes, the build is not debug of course, so MMAP_THRESHOLD
is 64<<20
)
Also for queries like in your example (SELECT ... GROUP BY ... LIMIT
, i.e. no ORDER BY and GROUP BY specials) group_by_overflow_mode=any + max_rows_to_group_by= can be forced (some kind of optimization), thoughts?
Only if it's executed in a single thread on a single server.
Is there a current workaround?
@genzgd is your original query has the same pattern? i.e.
If so, then as a workaround you can use one of the following:
@azat It will correct tracked memory amount but I doubt that RSS will be lower.
@azat It will correct tracked memory amount but I doubt that RSS will be lower.
You are right (I'm writing long reply in the PR)
will lead to sustained memory usage of 26 GiB despite the fact that the query only takes 4..5 GiB.
It was lowered to 13 GiB on current master. Still higher than needed.
RSS is 27 GB on my machine:
clickhouse-benchmark --max_threads 16 <<< "select WatchID from hits group by WatchID limit 10"
The query
will lead to sustained memory usage of 26 GiB despite the fact that the query only takes 4..5 GiB.