cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

Calculating the sequence diversity between two variants returns a key error #356

Open vsbuffalo opened 3 years ago

vsbuffalo commented 3 years ago

Hello,

I found this behavior to be a bit unintuitive... I would be happy to submit a PR to address it, but it's not necessarily clear what the right behavior is. Here is an MRE; the issue is that calculating diversity on a range between variant raises an exception, rather than just returning zero pairwise diversity:

import allel as al
g = al.GenotypeArray([[[0, 0], [0, 0]],
                          [[0, 0], [0, 1]],
                          [[0, 0], [1, 1]],
                          [[0, 1], [1, 1]],
                          [[1, 1], [1, 1]],
                          [[0, 0], [1, 2]],
                          [[0, 1], [1, 2]],
                          [[0, 1], [-1, -1]],
                          [[-1, -1], [-1, -1]]])

ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

al.sequence_diversity(pos, ac, start=21, stop=24)

Because the selected range resides between two variants, pos.locate_range() errors out with a KeyError. Here is the full traceback:


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-0cac1b30194e> in <module>
     12 pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
     13 
---> 14 al.sequence_diversity(pos, ac, start=21, stop=24)

~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/stats/diversity.py in sequence_diversity(pos, ac, start, stop, is_accessible)
    266     # deal with subregion
    267     if start is not None or stop is not None:
--> 268         loc = pos.locate_range(start, stop)
    269         pos = pos[loc]
    270         ac = ac[loc]

~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/model/ndarray.py in locate_range(self, start, stop)
   3609 
   3610         if stop_index - start_index == 0:
-> 3611             raise KeyError(start, stop)
   3612 
   3613         loc = slice(start_index, stop_index)

KeyError: (21, 24)

Generally, I think addressing this by modifying the behavior through pos.locate_range() is probably the wrong approach. Instead, we could wrap this in a try/except block for sequence_diversity() and like functions, and immediately return 0 if there are no variants within the region of interest (though, perhaps if no bases are accessible there, it should be np.nan instead? With the usual caveats this is a bit of a hack to deal with missingness values....).

Additionally, it's worth pointing out that this behavior is similar if no bases are accessible, but the range does include variants:

import allel as al
import numpy as np
g = al.GenotypeArray([[[0, 0], [0, 0]],
                          [[0, 0], [0, 1]],
                          [[0, 0], [1, 1]],
                          [[0, 1], [1, 1]],
                          [[1, 1], [1, 1]],
                          [[0, 0], [1, 2]],
                          [[0, 1], [1, 2]],
                          [[0, 1], [-1, -1]],
                          [[-1, -1], [-1, -1]]])

ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

al.sequence_diversity(pos, ac, start=12, stop=24, is_accessible=np.repeat(False, 30))

The issue here is that mask_inaccessible() is removing all positions that are inaccessible. In this case, I think a np.nan might be a more desired return value too?

I am curious what you think.

thanks, Vince