Hiveserver Hung with executing HIVE query

ambrish29 commented 6 years ago

We have setup Waggle dance and it has 1 primary metastore and 1 federated.

We have table which has roughly 100K partitions in primary metastore.

One of the user trying to execute a query and forgot to add partition filter.

Case when using Waggle thrift URI: Hive server is trying to read all the hive partitions and after sometime query failed with "Query timeout error". More over all the subsequent query started failing with "Query timeout error". This eventually lead to the outage from the hiveserver.

Case when using Primary metastore without Waggle dance: Hive server is trying to read all the hive partitions and after sometime query failed with "Query timeout error". But hiveserver was still responding. We tried to execute same query over and over but hiveserver did not went into hung state.

Example:

tableA : partition key = local_date+site_name Partition count: 100K partitions

Sample query select * from tableA where local_date = "random value"

Result: With waggle dance: "Query timeout error" with hiveserver in hung/non-responding state.

Without waggle dance: "Query timeout error" with hiveserver still in active state

massdosage commented 6 years ago

When the metastore service is requested to fetch a huge number of partitions this can result in odd behaviour, usually caused by OutOfMemoryExceptions in its JVM. WaggleDance itself implements the same Thrift API and will also potentially run out of memory on the same query. So firstly I would suggest putting some things in place to prevent users from forgetting to specify a partition filter (i.e. set set hive.mapred.mode = strict; in the config). Then I would look at the memory settings for the metastore service and for waggle dance. Waggle Dance should have at least as much memory as the metastore service, possibly a bit more. You should add monitoring on the JVM heap and possibly alerting when it goes over a certain threshold at which point you could trigger a restart of the service. It would also be advisable to set up a health check so that if an instance runs out of memory and dies you can restart it.

If you really want to be able to survive something like this happening we'd need to see the log output of both Waggle Dance and the metastore service when this happens as well as memory usage of both to try determine the cause.

saikrishnaburugula commented 6 years ago

Hi, These are the logs for the failed queires. this logs are when we were using waggle-dance in qubole log_174439254.txt this logs are when the qubole cluster is directly pointed to the hive metatsore. log_174446933.txt

massdosage commented 6 years ago

OK, these are the Hive CLI logs, I was referring to the Hive Metastore service server logs at the same period in time when these errors happen on the client. Both of these seem to indicate that something has gone wrong with the server so I'd like to know what the error in the server is. Also, can you get memory usage graphs of the server? You can configure the metastore service to output standard JMX stats and then graph then in something like Grafana.

massdosage commented 6 years ago

Closing as we don't currently have enough information to determine whether there are issues with the underlying Hive metastores and if so, what knock-on effects they are having in Waggle Dance. We need the logs and metrics mentioned in the previous comment.

ExpediaGroup / waggle-dance

Hiveserver Hung with executing HIVE query #126