cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
283 stars 49 forks source link

Feature Request - unphased IHS/XPEHH #374

Open kullrich opened 2 years ago

kullrich commented 2 years ago

Dear scikit-allel team,

would it be possible to add an unphased version of the IHS and XPEHH statistic to be directly used with a GenotypeArray?

The developer of selscan has just published an unphased version of both methods:

https://www.biorxiv.org/content/10.1101/2021.10.22.465497v1.full.pdf

Best regards and thank you in anticipation

Kristian

kullrich commented 2 years ago

Might be a starting point, since only biallelic sites are used and converted into a so-called multi-allelic representation (ref-hom (0), het (1), alt-hom (2), I guess I need to create a new ihh01_scan_unphased function in allel/opt/stats.pyx, a new ssl2ihh_unphased function in allel/opt/stats.pyx and create a new ihs_unphased function in allel/stats/selection.py

import allel
from allel.util import asarray_ndim
from allel.util import check_integer_dtype
from allel.util import check_dim0_aligned
from allel.compat import memoryview_safe

def genotype_to_multilocus(g, pos):
    """Converts a Genotype array into a Multilocus array,
    only retaining biallelic sites and corresponding positions.

    Parameters
    ----------
    g : array_like, int, shape (n_variants, n_samples, ploidy)
        Genotype array.
    pos : array_like, int, shape (n_variants,)
        Variant positions (physical distance).

    Returns
    -------
    ms : array_like, int, shape (n_variants, n_loci)
        Multilocus array.
    mspos : array_like, int, shape (n_variants,)
        Biallelic positions (physical distance).

    """

    # convert genotype array into multi-locus array
    ac = g.count_alleles()
    bipos = [x for x,y in enumerate(ac.allelism()) if y in [1,2]]
    gs = g.subset(bipos)
    # get multi-locus array
    ms = gs.is_het().astype(int)
    ms[gs.is_hom_alt()] = 2
    ms = allel.HaplotypeArray(ms)
    mspos = [y for x,y in enumerate(pos) if x in bipos]

    # check inputs
    ms = asarray_ndim(ms, 2)
    check_integer_dtype(ms)
    mspos = asarray_ndim(mspos, 1)
    check_dim0_aligned(ms, mspos)
    ms = memoryview_safe(ms)
    mspos = memoryview_safe(mspos)

    return [ms, mspos]

g = allel.GenotypeArray([[[0, 0], [0, 0], [0, 0]],
                         [[0, 0], [0, 1], [0, 1]],
                         [[0, 0], [1, 1], [0, 1]],
                         [[0, 1], [1, 1], [1, 1]],
                         [[1, 1], [1, 1], [-1, -1]],
                         [[0, 0], [1, 2], [0, 0]],
                         [[0, 1], [1, 2], [1, 2]],
                         [[0, 1], [-1, -1], [0, 0]],
                         [[-1, -1], [-1, -1], [-1, -1]]])

pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

ms, mspos = genotype_to_multilocus(g, pos)