allel.pairwise_distance treats missing calls as litteral -1

feilchenfeldt commented 2 years ago

Hello,

I am trying to calculate pairwise differences between sequences from a vcf with missing genotypes. However, allel.pairwise_distance seems to treat missing genotypes at litteral -1.

Here a minimal example of three (haploid) samples with two snps.

ht = allel.HaplotypeArray([[0, 0, -1],
                          [0, 1, -1]], dtype='i1')
d = allel.pairwise_distance(ht, metric='cityblock')
d_array = scipy.spatial.distance.squareform(d)
print(d_array)

[[0. 1. 2.]
 [1. 0. 3.]
 [2. 3. 0.]]

The last sample has all missing data. Still it has the largest distance to both other samples and is more distant to sample 2 than to sample 1. While it is not totally obvious how to account for missing data, just treating missing data as litteral -1 can't be the right approach and might lead to false conclusing if people are not aware of this unexpected behaviour.

In the absence of phasing or priors, I think the most sensible approach would be to count comparisons between two samples with one or both calls missing as zero (no difference) and then rescale each pairwise difference value by number_of_comparisons/number_of_comparisons_with_non_missing_data. I think this is also what plink does: https://www.cog-genomics.org/plink/1.9/distance#distance

Best wishes, Hannes

alimanfoo commented 2 years ago

Hi @feilchenfeldt, thanks for raising, yes absolutely the approach you suggest would be much better as the default. The scikit-allel pairwise_distance function is really just a thin wrapper around scipy pdist, and so any metrics like "cityblock" get passed through to scipy, which has no awareness of missingness.

With scikit-allel you could work around this by passing your own distance function as the value of the metric parameter.

I should also mention that scikit-allel is in maintenance mode and development focus has shifted to the sgkit project. I believe sgkit has a pairwise distance implementation, but it may not be complete or handling missing values in the best way yet, so might be worth starting a discussion to enquire.

feilchenfeldt commented 2 years ago

Thanks for the info @alimanfoo! A pity that allel is not in active development anymore.

For the records, I have written a function around np.einsum, that I think treats missing data correctly (see below). Maybe it is useful for someone.

Also, it treats comparsions between two hets correctly in case of diploid data. (allel.pairwise_distance counts 0 distance between two hets, while it should be half the distance between different homozygotes).

The performance of the function worse than allel.pairwise distance though.

def pairwise_differences(ac, an, correct_for_missing=True):
    """
    Calculate pairwise genetic difference between many
    individuals/populations/species from allele count matrix (2D-array).
    Allele counts can be from individual haplotypes (0 or 1),
    genotypes (0,1,2) (for diploids) or populations (0,1,2,..,an).
    an and ac have input dimensions number of variants x number of populations.
    If normalised by the accessible genome size, the diagonal of the output corresponds to within-population
    nucleotide diversity pi (or, for genotypes, heterozygosity), the off-diagonal entries correspond
    to nucleotide divergence d_xy.

    :param ac: 2D-array non-reference (or derived) allele count in haplotype/individual/population.
               Each row corresponds to a variant (e.g., SNP) site and each column corresponds to
               corresponds to a different haplotype/individual/population.
    :param an: 2D array of allele number for each variant in each haplotype/individual/population.
              Same dimensions as ac. Corresponds to the number of samples per haplotype/individual/population.
              E.g., an should be 1 for haplotypes and 2 for genotypes unless the variant call is missing.
              If the call is missing, then it is 0. For populations of diploid individuals an of SNP i in
              population j corresponds to the number of 2x the number of individuals in population j for
              which genotypes could be called at SNP i.
    :return: Symmetric (n populations x n populations) matrix of absolute counts of pairwise genetic difference.
             Each "population" can also just be a single individual or a single haplotype.

    """
    ac = np.array(ac)
    an = np.array(an).astype(float)
    an[an == 0] = np.nan
    af = ac / an
    af1 = np.nan_to_num(af)
    af1c = np.nan_to_num(1 - af)
    # this corrects for sampling without replacement in self comparisons
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", message="invalid value encountered in true_divide")
        warnings.filterwarnings("ignore", message="divide by zero encountered in true_divide")
        af1c_within = np.nan_to_num(1 - (ac - 1.) / (an - 1))
    pi = np.zeros((af.shape[1], af.shape[1]))
    np.fill_diagonal(pi, (2 * af1 * af1c_within).sum(axis=0))
    dxy = np.einsum('ij,ik->jk', af1, af1c) + np.einsum('ij,ik->jk', af1c, af1)
    np.fill_diagonal(dxy, 0)
    pwd = pi + dxy

    if correct_for_missing:
        # should we use an for all or only for variable sites to correct for this?
        n_pwc = n_pairwise_comparisons(an)
        pwd = pwd * len(an) / n_pwc

    return pwd

def n_pairwise_comparisons(an):
    has_data = np.array(an).astype(float)
    has_data[np.isnan(has_data)] = 0
    has_data[has_data > 0] = 1
    comparison_number = np.einsum('ij,ik->jk', has_data, has_data)
    return comparison_number

alimanfoo commented 2 years ago

Thanks @feilchenfeldt. It would be great if you could share these thoughts with the sgkit project somehow. Sgkit is basically scikit-allel version 2, and there are some great folks working on it (cc @jeromekelleher).

jeromekelleher commented 2 years ago

Hi @feilchenfeldt :wave: - yes, we're currently building up the first "public" release of sgkit at the moment, it would be great if you could try it out and see how it works for you.

feilchenfeldt commented 2 years ago

@jeromekelleher I will give it a try! Does sgkit.divergence account correctly for missing data and heterozygous genotypes if my "cohorts" are gt calls of diploid individuals?

Cheers, Hannes

jeromekelleher commented 2 years ago

Does sgkit.divergence account correctly for missing data and heterozygous genotypes if my "cohorts" are gt calls of diploid individuals?

Hmm, I'm not sure it does actually looking at the code but it certainly should. Any bug reports/code contributions along those lines would be much appreciated @feilchenfeldt!

feilchenfeldt commented 2 years ago

Any bug reports/code contributions along those lines would be much appreciated @feilchenfeldt!

I will test sgkit.divergence when I have time and file a report if I think something could be improved, @jeromekelleher!

cggh / scikit-allel

allel.pairwise_distance treats missing calls as litteral -1 #370