chaimleib / intervaltree

A mutable, self-balancing interval tree. Queries may be by point, by range overlap, or by range containment.
Apache License 2.0
636 stars 108 forks source link

Chop datafunc needs chop interval #112

Open jeffwu78 opened 2 years ago

jeffwu78 commented 2 years ago

The chop datafunc has access to the old interval and whether it is a lower or upper chop, but it does not have access to the requested chop interval. This seems like a major oversight. Am I missing something?

To clarify, in the example given the documentation

def datafunc(iv, islower): ... oldlimit = iv[islower] ... return "oldlimit: {0}, islower: {1}".format(oldlimit, islower) t = IntervalTree([Interval(0, 10)]) t.chop(3, 7, datafunc)

datafunc is given access to the old interval (0, 10), but not the actual chop interval (3, 7).

chaimleib commented 2 years ago

The datafunc exists to return what the data field of the resulting interval(s) should be. The result interval has not been finalized yet at the time the datafunc is called, and so it is not provided. However, its bounds can be determined by combining the arguments given with knowledge of the boundary values provided to chop (via lexical scoping).

iv as provided in your quoted snippet is one of the original intervals that are getting sliced, in case this is needed to build the return value.

islower is a boolean that says whether the resulting interval is less than lower bound provided to chop. If so, it is True; otherwise, islower is False and the resulting interval is greater than the upper bound provided to chop. In the snippet, I used it to help select the boundary value of iv that will be obliterated in the chop in oldlimit = iv[islower].

Note that if the chop boundaries both lie within an interval to be chopped, datafunc will be called twice with the same iv: once with islower = True and once with islower = False.

jeffwu78 commented 2 years ago

The interval data field in my application is a time series, and in my datafunc function, I want to cut out the part of the time series not belonging to chop interval. The problem is the datafunc function does not have access to the endpoints of the chop interval needed to do this.

Although now that I think of it, there is a work-around to the problem. I have an "extended" datafunc function also takes in the chop interval endpoints, and then use a lambda function to pass in these endpoints and create the desired datafunc for the chop function.

johentsch commented 1 year ago

Found this when I had the same question for .slice(), so I'm spelling out the above solutions by @chaimleib and @jeffwu78 which work the same way for .chop() by integrating the second parameter.

Problem

def datafunc(sliced_interval, islower):
    a, b, _  = sliced_interval
    if islower:
        return f"new data = remove ['point', {b}) from [{a}, {b})"
    return f"new data = remove [{a}, 'point') from [{a}, {b})"

t = IntervalTree([Interval(0, 10), Interval(5, 15)])
t.slice(6, datafunc)
Result where datafunc does not have access to slice point 6

```python [Interval(0, 6, "new data = remove ['point', 10) from [0, 10)"), Interval(5, 6, "new data = remove ['point', 15) from [5, 15)"), Interval(6, 10, "new data = remove [0, 'point') from [0, 10)"), Interval(6, 15, "new data = remove [5, 'point') from [5, 15)")] ```

Solution inspired by @chaimleib

def slice_the_tree(t, point):

    def datafunc(sliced_interval, islower):
        a, b, _  = sliced_interval
        if islower:
            return f"new data = remove [{point}, {b}) from [{a}, {b})"
        return f"new data = remove [{a}, {point}) from [{a}, {b})"

    t.slice(point, datafunc)

t = IntervalTree([Interval(0, 10), Interval(5, 15)])
slice_the_tree(t, 6)
Result where datafunc has access to slice point 6

```python [Interval(0, 6, 'new data = remove [6, 10) from [0, 10)'), Interval(5, 6, 'new data = remove [6, 15) from [5, 15)'), Interval(6, 10, 'new data = remove [0, 6) from [0, 10)'), Interval(6, 15, 'new data = remove [5, 6) from [5, 15)')] ```

Solution inspired by @jeffwu78

def extended_datafunc(sliced_interval, islower, point):
    a, b, _  = sliced_interval
    if islower:
        return f"new data = remove [{point}, {b}) from [{a}, {b})"
    return f"new data = remove [{a}, {point}) from [{a}, {b})"

t = IntervalTree([Interval(0, 10), Interval(5, 15)])
t.slice(6, lambda sliced_interval, islower: extended_datafunc(sliced_interval, islower, 6))
Result as above

```python [Interval(0, 6, 'new data = remove [6, 10) from [0, 10)'), Interval(5, 6, 'new data = remove [6, 15) from [5, 15)'), Interval(6, 10, 'new data = remove [0, 6) from [0, 10)'), Interval(6, 15, 'new data = remove [5, 6) from [5, 15)')] ```

More legible adaptation

This solution uses the extended_datafunc() from the previous solution but uses partial initialization instead of the lambda:

from functools import partial
t = IntervalTree([Interval(0, 10), Interval(5, 15)])
t.slice(6, partial(extended_datafunc, point=6))