apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.53k stars 3.71k forks source link

Optimize ReservoirSegmentSampler.getRandomBalancerSegmentHolder() #7147

Open leventov opened 5 years ago

leventov commented 5 years ago

The current algorithm is O(N_SEGMENTS_IN_CLUSTER), it can be O(N_SERVERS_IN_CLUSTER) + O(log(N_MAX_SEGMENTS_ON_A_SERVER) via reservoir choice of a server first using populations of servers, then a random segment can be chosen on a server using this hack.

shivtools commented 5 years ago

@leventov why would you choose to use reservoir choice of a server as opposed to randomly choosing a server since we already know the size of servers? Given this random server, we could then choose a random segment using the hack you've shown (very cool btw!).

I've put up https://github.com/apache/incubator-druid/pull/7174 with the approach that you suggested, but chose a server at random instead of using reservoir sampling (I can change this if required)

leventov commented 5 years ago

@shivtools the previous implementation was choosing a segment uniformly at random in a cluster. If you just choose a random server first, segments that are served by less populated servers have a higher chance of being chosen overall. Or am I wrong?

shivtools commented 5 years ago

Hey @leventov, you're right! I tuned the server selection to use reservoir sampling as you suggested.