cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

windowing: weird(?) behavior with the last window #363

Open mufernando opened 3 years ago

mufernando commented 3 years ago

In the example below, we have four sites and I ask scikit-allel to output windowed diversity with windows of length 1.

Screen Shot 2021-06-25 at 15 30 33

Note, however, that I only get 3 values in the diversity array, and the last window ends up including sites 3 and 4.

This seems a bit odd to me, so I was trying to understand why scikit-allel handles it like this. Just to spell it out: I find it odd that the windows end up being 1,1], [2,2] and [3,4] instead of [1,1], [2,2], [3,3], [4,4].

cc @petrelharp

petrelharp commented 3 years ago

Here's what's happening:

  1. windows are specified as a sequence of (1-based) nonoverlapping [start, stop] values, with both ends inclusive.
  2. specifying start, stop, step uses range(start, stop, step) here to produce the starting points.
  3. range(1, 4, 1) produces [1,2,3], for the starting points, and then 4 is reserved for the last stop value, which makes sense, as you did say stop=4.

@mufernando, it's not clear to me what the right thing is here. At least this should be documented more (eg. say that endpoints of windows are both inclusive), but I'd have to work some more examples to figure out the consequences of different ways of doing this.

alimanfoo commented 3 years ago

Looking at the implementation of position_windows again, I'm not sure why I did it like that. Perhaps this would be better:

    for window_start in range(start, stop, step):

        # determine window stop (inclusive)
        window_stop = window_start + size - 1
        if window_stop > stop:
            # last window
            window_stop = stop
            last = True

        windows.append([window_start, window_stop])

        if last:
            break
petrelharp commented 3 years ago

That seems like it'd be better...

mufernando commented 3 years ago

That looks good @alimanfoo. The only thing to change there would be to do the range up until stop+1, because in general 1-based systems are left and right inclusive (see this).