Change the searching code in DynamicIntervalTree to make it more efficient

Gigatron commented 11 years ago

Using binary instead of the one-by-one approach:ｒeduce the time cost for search all overlapping intervals to O(log n + k)

KevinStern commented 11 years ago

Boyu,

Thank you for the message; unfortunately, I don't believe that your algorithm gives the stated runtime property. In fact, I don't believe that any correct algorithm that chooses paths through a binary tree containing intervals ordered by their low endpoint would be able to achieve the stated runtime property without some additional trickery. In proof, consider such a balanced binary tree containing n intervals ordered by their low endpoint and a query interval that is to the right of all such intervals except for the interval with the lowest low endpoint among those in the tree, the interval with the lowest low endpoint among those in the right subtree of the root, the interval with the lowest low endpoint among those in the right subtree of the right subtree of the root, and so on. Any correct algorithm that simply chooses paths through the tree would need to traverse at least log_2(n) nodes, then log_2(n) - 1 nodes, and so on, for a total of sum[i=0->log_2(n)](log2(n) - i), or (1/2)(log_2(n)^2 + log_2(n)) which is O(log_2(n)^2) (or O(k_log_2(n)) when viewed from the k = log_2(n) perspective). Note that this is simply a counter-example to show that your approach is not O(log(n) + k); it is not a proof that your approach is O(k*log_2(n)). Please let me know if you think that I've misunderstood your approach or if you think that my analysis is incorrect; otherwise, I will leave the code as-is.

Warm Regards,

Kevin

Gigatron commented 11 years ago

Hi Kevin,

Yeah your counter example is correct. But I think the average cost might be better considering it doesn't have to perform the insert/delete operation for each query. For example if the query interval is so large that all intervals are going to be returned. The previous algorithm will cost O(n log n) time since it's almost rebuilding the tree. But the modified will only take O(n) time since it just visits each node once and add it to the query result.

KevinStern commented 11 years ago

Hi Boyu,

Your algorithm is not O(k_log_2(n)); in fact, it can consume log2(n)*2 time in finding a single interval. Consider an interval tree containing n intervals and a query interval that is to the right of all such intervals except for the interval with the lowest low endpoint among those in the tree. It will traverse each entire ~log_2(n) sized right-subtree path originating from each node in the left-subtree path leading to the single included interval.

On a different note: GitHub tells me that this thread was closed by me; though, I don't recall ever closing it. I'll go ahead and re-open.

Regards,

Kevin

Gigatron commented 11 years ago

Hi Kevin,

In your example it will only check the root node per right subtree per level since the following condition is true when visiting them,

271  +    if(queryInterval.getLow().compareTo(root.getMaximumHighEndpoint()) > 0)
272  +      return;

therefore it will visit 2logn nodes instead of (logn)^2 as you said.

Thanks, Boyu

KevinStern commented 11 years ago

Hi Boyu,

You are right; I had misread your code and was analyzing the algorithm that evaluates whether or not to traverse the right subtree independent of whether or not the current subtree falls to the left of the query interval. I have considered a variant of your approach that maintains both a maximum high endpoint and a minimum low endpoint value for the subtree rooted at each node; then, the search algorithm visits a node iff the node's subtree falls neither to the right nor to the left of the query interval. Unless I've made a mistake in my analysis, this variant is O(k*log_2(n)) time:

Given a dynamic interval tree with comparator < (which induces an ordering by low endpoint and deterministically resolves order among intervals with equal low endpoints) containing n nodes:

As discussed earlier in the thread, in the worst case, each of the k intervals overlapping the query interval require log_2(n) time to find with perfect decision making on average. If the search algorithm makes perfect decisions, then we're done; otherwise, consider a node q visited by the search algorithm that does not root a subtree containing an interval that overlaps with the query interval. By the visit criteria, the subtree rooted at q must lie neither to the right nor to the left of the query interval; therefore, the query interval must fall between two intervals contained within the subtree. Select the greatest interval (with respect to <) with low-endpoint that does not exceed the query interval's low endpoint i1 and the least interval (with respect to <) with low_endpoint that does not preceed the query interval's high endpoint i2 from the subtree rooted at q. There is no interval i in the full tree such that i1 < i < i2. Towards a contradiction, assume otherwise; i could not fall within the subtree rooted at q, else it would contradict either the selection of i1 and i2 or the assumption that no interval in the subtree rooted at q overlaps with the query interval, but then the interval tree is not a binary search tree with comparator < (a contradiction). Therefore, each node q is an ancestor of both i1 and i2, so the algorithm traverses at most log_2(n) such nodes.

I've given your variant of the algorithm only cursory thought. I suspect that it exhibits the same worst-case runtime performance but would require a bit more work to demonstrate.

KevinStern commented 11 years ago

For completion of this thread, I've pushed the modifications that I discussed in my previous comment to DynamicIntervalTree.

KevinStern / software-and-algorithms

Change the searching code in DynamicIntervalTree to make it more efficient #3