victorialogs: query performance for non-existent substring

VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database

https://victoriametrics.com/

Apache License 2.0

12.11k stars 1.2k forks source link

victorialogs: query performance for non-existent substring #7233

Open peonqi opened 2 days ago

peonqi commented 2 days ago

Is your question request related to a specific component?

victorialogs

Describe the question in detail

I query 80GB of logs (20 million entries) within one hour, Searching for an not exist substring。 using the following query conditions, and it return 0 record：

 _stream:{stream="normal.target"}  log.namespace:="Production" _msg:~"cc641faf212b6xddgdews26d78" _time:[2024-10-08T13:34:42Z,2024-10-08T13:49:42Z) | sort by (_time desc ) | limit 30

it takes 20 seconds. For the same query, ClickHouse only takes 2 seconds. Both Victorialogs and ClickHouse are configured with 64-core CPUs and 256GB of memory.

Troubleshooting docs

[ ] General - https://docs.victoriametrics.com/troubleshooting/
[ ] vmagent - https://docs.victoriametrics.com/vmagent/#troubleshooting
[ ] vmalert - https://docs.victoriametrics.com/vmalert/#troubleshooting

Haleygo commented 2 days ago

_stream:{stream="normal.target"} log.namespace:="Production" _msg:~"cc641faf212b6xddgdews26d78" _time:[2024-10-08T13:34:42Z,2024-10-08T13:49:42Z) | sort by (_time desc ) | limit 30

There are two expensive subqueries in this particular expression, substring filter _msg:~"cc641faf212b6xddgdews26d78" and sort pipe sort by (_time desc ), both of them slow down the query. It would be helpful to understand the time spent on each subquery or pipe like query-tracing in victoriametrics. cc @valyala

peonqi commented 2 days ago

_stream:{stream="normal.target"} log.namespace:="Production" _msg:~"cc641faf212b6xddgdews26d78" _time:[2024-10-08T13:34:42Z,2024-10-08T13:49:42Z) | sort by (_time desc ) | limit 30

There are two expensive subqueries in this particular expression, substring filter _msg:~"cc641faf212b6xddgdews26d78" and sort pipe sort by (_time desc ), both of them slow down the query. It would be helpful to understand the time spent on each subquery or pipe like query-tracing in victoriametrics. cc @valyala

yes，The above query is indeed very resource-intensive, but with the same amount of data and the same query logic, ClickHouse takes much less time than Victorialogs. I'm wondering, is there still a lot of room for optimization for Victorialogs with this kind of query?