apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.46k stars 3.7k forks source link

CuratorInventoryManager may not report inventory as initialized #6176

Open gianm opened 6 years ago

gianm commented 6 years ago

The reason is unclear, but we've seen brokers running 0.12.1 fail to report "Inventory Initialized" events even after running for long periods of time. Nevertheless, the broker seems to be able to do native queries normally, and seems to be properly aware of all historicals. Restarting the broker generally clears this up. It seems to be somewhat random whether or not this will occur on any given broker startup.

This has one knock-on effect: DruidSchema (the SQL layer's metadata cache) doesn't activate until after inventory is initialized, and so this blocks the SQL layer from working. It just reports that there are no tables.

Checking /druid/broker/v1/loadstatus returns false when the broker is in this state.

himanshug commented 6 years ago

It might be good time to retire CuratorInventoryManager and switch segment management and discovery on http versions. I tested for a long while, also I think @jihoonson told that you guys had some clusters running with http version. how do we feel about making http version default from next release? it is all backward compatible and users shouldn't see any behavioral change. once http is default, we can remove zk version after 1-2 releases.

jihoonson commented 6 years ago

@himanshug would you give us more details about your test environments?

In our company, we are still testing it in our test cluster, but not in production yet. One thing I'm concerned is the increasing HTTP connections. On the other day, I could see Kafka indexing service was using too many HTTP connections compared to the number of worker threads even though the cluster was not using HTTP-based orverlords or coordinators. The number of HTTP connections was a few thousand which is not so high, but I'm not sure what is the proper default configuration for the number of worker threads.

Also, there is at least a critical bug which causes a deadlock when starting up (https://github.com/apache/incubator-druid/issues/6201).

jihoonson commented 6 years ago

Also, I would say we should remove all remaining codes writing data on ZK in HTTP-based overlords/coordinators before making it default.

himanshug commented 6 years ago

@jihoonson My test environment had 3 brokers, 2 coordinators, 2 overlords, ~40 Middle Managers (each running about 6 kafka indexing tasks created by kafka supervisor), about 15-20 Historicals.

some background and information is noted in https://groups.google.com/forum/#!msg/druid-development/eIWDPfhpM_U/AzMRxSQGAgAJ but, there are 3 completely independent things here... 1) switching coordinator to use HTTP (using HttpLoadQueuePeon) for segment assignment (load/drop) 2) switching broker/coordinator to use HTTP (using HttpServerInventoryView) for discovering what segments are served by queryable nodes (historicals, and peons doing indexing) 3) switching overlord to use HTTP for task mgmt (using HttpRemoteTaskRunner)

In my comment above I was talking about trying making (1) and (2) default after a bit of testing on some more clusters that you have.

looks like #6201 pertains to (3) , so let us not consider enabling (3) by default at this time until we get to the bottom of #6201 .

However, after (1), (2) and (3) are done with druid clusters using HTTP . And, we remove coordinator/overlord service announcement that is always done in ZK, to support tranquility. Then , technically, it becomes possible to write extensions for discovery that don't necessarily use zookeeper and use say etcd instead. However, this is also an independent activity which will take its own time, so don't want to make it a prerequisite for trying out http or default to it as we gain more confidence with those features. And, remove zookeeper code in phases that is not needed (i.e. after say 4-6 months from a release where specific thing was made default)

each of (1), (2) lead to one additional connection per broker/coordinator to each queryable node. (3) leads to one additional connection per overlord to each MiddleManager node.

On broker/coordinator/overlord side, EscalatedGlobal httpClient is used for making requests, so connections from their pools are used, no new connection pools are created.

One thing I'm concerned is the increasing HTTP connections.

theoretically, it should be OK and so far testing above , I haven't seen any connections issue popping up due to these features. but, concern is valid and we can be more confident only as we roll it on more clusters.

On the other day, I could see Kafka indexing service was using too many HTTP connections compared to the number of worker threads even though the cluster was not using HTTP-based orverlords or coordinators. The number of HTTP connections was a few thousand which is not so high, but I'm not sure what is the proper default configuration for the number of worker threads.

I am assuming you meant overlord http client [worker threads] had thousands of outbound open connections. for EscalatedGlobal client used by KIS as well, number of connections are set at https://github.com/apache/incubator-druid/blob/master/server/src/main/java/io/druid/guice/http/HttpClientModule.java#L140 (default value is 20 ). So, at overlord, from that httpClient, maximum possible connections = 20 (or whatever is configured) X (number of KIS task peons, and any other processes that overlord could talk to using this client over HTTP) from https://github.com/apache/incubator-druid/blob/master/server/src/main/java/io/druid/initialization/Initialization.java#L377 , I see there are at least 3 other HttpClient instances created with their own connection pools, so see if those are using the connections. if above accounts for thousands of connections, then it is explained or else there is some bug in HttpClient code and it creates more connections than it is told to. It would be good if you take a look at what host:port those connections are going to and see if those connections numbers make sense from the expectations above.

that said, features in (1), (2), (3) don't necessarily worsen the situation because we have far more http requests all around going on due to other features. I may be proven wrong in the end, but we wouldn't know till we try :) .

jihoonson commented 6 years ago

@himanshug thanks for the details.

  1. switching coordinator to use HTTP (using HttpLoadQueuePeon) for segment assignment (load/drop)
  2. switching broker/coordinator to use HTTP (using HttpServerInventoryView) for discovering what segments are served by queryable nodes (historicals, and peons doing indexing)
  3. switching overlord to use HTTP for task mgmt (using HttpRemoteTaskRunner) In my comment above I was talking about trying making (1) and (2) default after a bit of testing on some more clusters that you have.

looks like #6201 pertains to (3) , so let us not consider enabling (3) by default at this time until we get to the bottom of #6201 .

Thanks. It sounds good to me.

However, after (1), (2) and (3) are done with druid clusters using HTTP . And, we remove coordinator/overlord service announcement that is always done in ZK, to support tranquility. Then , technically, it becomes possible to write extensions for discovery that don't necessarily use zookeeper and use say etcd instead. However, this is also an independent activity which will take its own time, so don't want to make it a prerequisite for trying out http or default to it as we gain more confidence with those features. And, remove zookeeper code in phases that is not needed (i.e. after say 4-6 months from a release where specific thing was made default)

This sounds good to me. What I meant for removing remaining codes writing data on ZK is this kind of things. Would you tell us that there are some reasons that we can't remove these things right now? If not, I think we should clear them out before making it default. Looks that these things only remain for HTTP-based task allocation, so it might not be an issue for 1. and 2.. For tranquility, I would say we don't have to improve tranquility to support HTTP-based overlords at this point.

For the number of HTTP connections, I have checked those connections and they were all valid. The issue was the small number of HTTP server/client worker threads, not the large number of HTTP connections. (For configurations for the number of worker threads, see https://github.com/apache/incubator-druid/blob/master/server/src/main/java/io/druid/guice/http/DruidHttpClientConfig.java#L46 and https://github.com/apache/incubator-druid/blob/master/extensions-core/kafka-indexing-service/src/main/java/io/druid/indexing/kafka/supervisor/KafkaSupervisor.java#L281-L283.) So, I'm not worrying about a few additional HTTP connections, but we may need to change the default configurations because default configurations should not make users confused. If there are only one additional connection per master/worker, I guess it would be fine.

himanshug commented 6 years ago

What I meant for removing remaining codes writing data on ZK is this kind of things.

I see, well that specific code snippet has to exist to allow users change from RemoteTaskRunner to HttpRemoteTaskRunner easily. MiddleManagers are configured to keep on announcing task status in ZK even after you switch overlord to use HTTP , so the zk cleanup still needs to happen. Once everyone is using HTTP remote task runner, then we can disable MM from announcing task status in ZK and remove the code you mentioned.

jihoonson commented 6 years ago

Thanks. It sounds like this is definitely needed to support rolling update if HTTP task runner becomes default.

waixiaoyu commented 5 years ago

It seems like this bug https://issues.apache.org/jira/browse/CURATOR-476

gianm commented 5 years ago

It sounds like we should try upgrading to Curator 4.1.0 and see what that does.

clintropolis commented 5 years ago

Reopening until issue is confirmed resolved by #6862

github-actions[bot] commented 1 year ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.