I found this behavior to be a bit unintuitive... I would be happy to submit a PR to address it, but it's not necessarily clear what the right behavior is. Here is an MRE; the issue is that calculating diversity on a range between variant raises an exception, rather than just returning zero pairwise diversity:
Because the selected range resides between two variants, pos.locate_range() errors out with a KeyError. Here is the full traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-0cac1b30194e> in <module>
12 pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
13
---> 14 al.sequence_diversity(pos, ac, start=21, stop=24)
~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/stats/diversity.py in sequence_diversity(pos, ac, start, stop, is_accessible)
266 # deal with subregion
267 if start is not None or stop is not None:
--> 268 loc = pos.locate_range(start, stop)
269 pos = pos[loc]
270 ac = ac[loc]
~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/model/ndarray.py in locate_range(self, start, stop)
3609
3610 if stop_index - start_index == 0:
-> 3611 raise KeyError(start, stop)
3612
3613 loc = slice(start_index, stop_index)
KeyError: (21, 24)
Generally, I think addressing this by modifying the behavior through pos.locate_range() is probably the wrong approach. Instead, we could wrap this in a try/except block for sequence_diversity() and like functions, and immediately return 0 if there are no variants within the region of interest (though, perhaps if no bases are accessible there, it should be np.nan instead? With the usual caveats this is a bit of a hack to deal with missingness values....).
Additionally, it's worth pointing out that this behavior is similar if no bases are accessible, but the range does include variants:
The issue here is that mask_inaccessible() is removing all positions that are inaccessible. In this case, I think a np.nan might be a more desired return value too?
Hello,
I found this behavior to be a bit unintuitive... I would be happy to submit a PR to address it, but it's not necessarily clear what the right behavior is. Here is an MRE; the issue is that calculating diversity on a range between variant raises an exception, rather than just returning zero pairwise diversity:
Because the selected range resides between two variants,
pos.locate_range()
errors out with aKeyError
. Here is the full traceback:Generally, I think addressing this by modifying the behavior through
pos.locate_range()
is probably the wrong approach. Instead, we could wrap this in atry/except
block forsequence_diversity()
and like functions, and immediately return 0 if there are no variants within the region of interest (though, perhaps if no bases are accessible there, it should benp.nan
instead? With the usual caveats this is a bit of a hack to deal with missingness values....).Additionally, it's worth pointing out that this behavior is similar if no bases are accessible, but the range does include variants:
The issue here is that mask_inaccessible() is removing all positions that are inaccessible. In this case, I think a
np.nan
might be a more desired return value too?I am curious what you think.
thanks, Vince