daleroberts / hdmedians

High-dimensional medians (medoid, geometric median, etc.). Fast implementations in Python.
Apache License 2.0
71 stars 15 forks source link
high-dimensional-data machine-learning median python statistics

Hdmedians

Did you know there is no unique way to mathematically extend the concept of a median to higher dimensions?

Various definitions for a high-dimensional median exist and this Python package provides a number of fast implementations of these definitions. Medians are extremely useful due to their high breakdown point (up to 50% contamination) and have a number of nice applications in machine learning, computer vision, and high-dimensional statistics.

This package currently has implementations of medoid and geometric median with support for missing data using NaN.

Installation

The latest version of the package is always available on pypi, so can be easily installed by typing:

pip3 install hdmedians

Medoid

Given a finite set of -dimensional observation vectors , the medoid of these observations is given by

The current implementation of medoid is in vectorized Python and can handle any data type supported by ndarray. If you would like the algorithm to take care of missing values encoded as nan then you can use the nanmedoid function.

Examples

Create an 6 x 10 array of random integer observations.

>>> import numpy as np
>>> X = np.random.randint(100, size=(6, 10))
array([[12,  9, 61, 76,  2, 17, 12, 11, 26,  0],
       [65, 72,  7, 64, 21, 92, 51, 48,  9, 65],
       [39,  7, 50, 56, 29, 79, 47, 45, 10, 52],
       [70, 12, 23, 97, 86, 14, 42, 90, 15, 16],
       [13,  7,  2, 47, 80, 53, 23, 59,  7, 15],
       [83,  2, 40, 12, 22, 75, 69, 61, 28, 53]])

Find the medoid, taking the last axis as the number of observations.

>>> import hdmedians as hd
>>> hd.medoid(X)
array([12, 51, 47, 42, 23, 69])

Take the first axis as the number of observations.

>>> hd.medoid(X, axis=0)
array([39,  7, 50, 56, 29, 79, 47, 45, 10, 52])

Since the medoid is one of the observations, the medoid function has the ability to only return the index if required.

>>> hd.medoid(X, indexonly=True)
6
>>> X[:,6]
array([12, 51, 47, 42, 23, 69])

Geometric Median

The geometric median is also known as the 1-median, spatial median, Euclidean minisum, or Torricelli point. Given a finite set of -dimensional observation vectors , the geometric median of these observations is given by

Note there is a subtle difference between the definition of the geometric median and the medoid: the search space for the solution differs and has the effect that the medoid returns one of the true observations whereas the geometric median can be described as a synthetic (not physically observed) observation.

The current implementation of geomedian uses Cython and can handle float64 or float32. If you would like the algorithm to take care of missing values encoded as nan then you can use the nangeomedian function.

Examples

Create an 6 x 10 array of random float64 observations.

>>> import numpy as np
>>> np.set_printoptions(precision=4, linewidth=200)
>>> X = np.random.normal(1, size=(6, 10))
array([[ 1.1079,  0.5763,  0.3072,  1.2205,  0.8596, -1.5082,  2.5955,  2.8251,  1.5908,  0.4575],
       [ 1.555 ,  1.7903,  1.213 ,  1.1285,  0.0461, -0.4929, -0.1158,  0.5879,  1.5807,  0.5828],
       [ 2.1583,  3.4429,  0.4166,  1.0192,  0.8308, -0.1468,  2.6329,  2.2239,  0.2168,  0.8783],
       [ 0.7382,  1.9453,  0.567 ,  0.6797,  1.1654, -0.1556,  0.9934,  0.1857,  1.369 ,  2.1855],
       [ 0.1727,  0.0835,  0.5416,  1.4416,  1.6921,  1.6636,  1.6421,  1.0687,  0.6075, -0.0301],
       [ 2.6654,  1.6741,  1.1568,  1.3092,  1.6944,  0.2574,  2.8604,  1.6102,  0.4301, -0.3876]])
>>> X.dtype
dtype('float64')

Find the geometric median, taking the last axis as the number of observations.

>>> import hdmedians as hd
>>> hd.geomedian(X)
array([ 1.0733,  0.8974,  1.1935,  0.9122,  0.9975,  1.3422])

Take the first axis as the number of observations.

>>> hd.geomedian(X, axis=0)
array([ 1.4581,  1.6377,  0.7147,  1.1257,  1.0493, -0.091 ,  1.7907,  1.4168,  0.9587,  0.6195])

Convert to float32 and compute the geometric median.

>>> X = X.astype(np.float32)
>>> m = hd.geomedian(X)

References