Log queries before execution

avleen commented 9 years ago

Currently the best way to emulate a full query log, is to set the time on the slow query log to 0s. The downside to this, is that queries aren't logged until they finish running. If the query takes a really long time (we've seen queries that can overload a box, cause GC problems and basically never end), you never find out what the query was.

Can we have a separate log which records all queries before they are run?

souravmitra commented 9 years ago

@clintongormley : Do you have any concerns regarding the implementation for this request. What factors should somebody take care of, if he were to think about implementing the same. Thanks.

avleen commented 9 years ago

From an operational perspective, it would be great to see two lines per query:

First before the query is run.
Second after it finishes, and logs the time it took.

The lines would need to have some kind of common uuid logged (or something) so they can be easily paired up. Specifically for the use case I mentioned, being able to grep for UUIDs and seeing ones that only appeared once would immediately make problematic queries obvious!

vvcephei commented 9 years ago

This is something that we'd dearly love to see as well. In fact, we would be satisfied with just a regular full query log. We almost took a crack at implementing it and sending a PR, but decided just to update and use the Jetty plugin instead.

We were a little apprehensive about using the 0ms slow query log, as we have pretty high query volume (8k requests/sec), and we weren't sure we could trust the slow query logger not to bottleneck performance (and also not to fill the disk).

The other big problem with the slow query log is that it logs fetch and query separately at the shard level, which I agree is useful for analyzing slow queries, but it's not what you want if you're trying to measure the full query execution time.

The jetty plugin works pretty well for query logging, but:

I'm not stoked about replacing netty with jetty just to get query logging
It only logs the full execution time after the request, so doesn't fully solve @avleen's issue
I can't comment on production performance over a long period of time, since we only turn on query logs when we want to collect a sample of logs for a few hours or days.

nirmalc commented 9 years ago

+1 , @clintongormley is this "adoptme" one ? we need this feature too and open to working on it .

clintongormley commented 9 years ago

It looks like this has been marked as discuss for a while, but hasn't actually been discussed yet :)

Just some thoughts to get the discussion going:

Doc values by default in 2.0 will help with many cases of OOM/slow GC (but not all)
Often slow GCs are the compound result of a number of requests, rather than a single bad request (although one bad request can be responsible)
The top-searches feature (#12187) will help to identify current long queries
I think not all requests using the Java API can be rendered as JSON currently (#12992)
Would we log on the coordinating node or the data node? If the latter, per shard or per node?

nik9000 commented 9 years ago

I'm a fan of this request. In a previous life we did this logging on the client side and used it to find a few bugs.

Often slow GCs are the compound result of a number of requests, rather than a single bad request (although one bad request can be responsible)

But it'll still be visible in the logs. The logs might not be the right tool for identifying what is causing them but this could be helpful.

The top-searches feature (#12187) will help to identify current long queries

Its more black-box-ish than a log at start and stop. Logs around the query starting and stopping are more bullet proof I think.

I think not all requests using the Java API can be rendered as JSON currently (#12992)

This is probably worth fixing.

Would we log on the coordinating node or the data node? If the latter, per shard or per node?

Probably all and make turning it on and off an index level dynamic setting. I'd settle for just doing it on the coordinating node ala SearchSlowLog as a first cut. That is only slightly better than what clients can do.

paullovessearch commented 8 years ago

+1

evanvolgas commented 8 years ago

This should probably xref Etsy's ES Restlog (https://github.com/etsy/es-restlog#overview). The points they bring up are very good, especially the point about

{the slow query log} operates at the shard-request level so you end up with lots of lines logged in case there are multiple shards or query phases involved.

I'm also a little unclear on the state of this discussion. It kinda seems like this got de-prioritized in favor of https://github.com/elastic/elasticsearch/issues/12187 which then maybe kinda sorta got rolled into this https://github.com/elastic/elasticsearch/issues/15117 somewhat maybe (I'm actually not sure I follow how 15117 addresses this issue or 12187 exactly unless searches, including failed searches, are tasks too... or maybe it's just Friday afternoon and I'm overlooking something very obvious?).

The original discussion seemed to focus on troubleshooting GC, for example. For me, with ES 2*, my experience has been exactly what @clintongormley suspected they would be, eg,

Doc values by default in 2.0 will help with many cases of OOM/slow GC (but not all)

Often slow GCs are the compound result of a number of requests, rather than a single bad request (although one bad request can be responsible)

That being said, I still think this idea has a lot of merit. One of the things the MySQL community got very right was the work they did with the Percona query digest (https://www.percona.com/doc/percona-toolkit/2.1/pt-query-digest.html). A DBA with the output of PT Query Digest can usually zoom in on a handful of bad query "fingerprints" that are causing an inordinate amount of work on the database and clean them up to great effect.

This kind of analysis -- figuring out what queries caused the database to do the most work -- fundamentally requires the ability to review and analyze the queries that the cluster is responding to -- on a per query (as opposed to a per shard) level. That's not really feasible with the slow query log.

And, specifically where that overlaps what this ticket is talking about, it's also not possible to identify query fingerprints that have a tendency to result in time outs or failures and might need some help. In some cases, being unable to identify these fingerprints might be devastating. Suppose you have a query on a cron job that's set to alert you if a certain event threshold is exceeded (or suppose you use Watcher or Elastalert and set up the same thing there). What if your data grows or your keys get skewed and a query that started out just fine develops a tendency to time out out every time you run it. Will this cron job / Watcher / Elastalert alert you to the fact that the query it's running keeps failing? The cron job might... and Elastalert and Watcher will certainly complain in their log files (maybe there's a way to push alerts on query failure but, if there is, I haven't noticed it). But will the developer / administrator of the cluster see the failures in the logs? It's not difficult to imagine that they wouldn't.

Personally, I think logging queries before execution is a great idea. Further, to the point @avleen made, I think it would also be a great idea to log queries after they execute, and to log them on the query (as opposed to shard) level.

Logging slow queries at the shard level makes great sense. But when you're trying to figure out what queries are failing most often and/or what queries your cluster spends the most time answering, I don't think the shard level is the right place to keep track of that information. As a datastore administrator, it is considerably easier for me to help developers understand which of their queries are problematic and/or failing when they're working with MySQL. With Elasticsearch... it can be done. But honestly it's a lot harder than it probably should because I don't have a cluster level record of

What queries users sent to the cluster
Whether or not the query failed
How long the query took from the time the cluster accepted it to the time it issued a response to the client

clintongormley commented 7 years ago

Logging queries at the start of execution means that you need a second log line for the execution time, which then complicates log parsing. You can now check the task manager for long running queries, and optionally kill them.

elastic / elasticsearch

Log queries before execution #9172