Did you know there is no unique way to mathematically extend the concept of a median to higher dimensions?
Various definitions for a high-dimensional median exist and this Python package provides a number of fast implementations of these definitions. Medians are extremely useful due to their high breakdown point (up to 50% contamination) and have a number of nice applications in machine learning, computer vision, and high-dimensional statistics.
This package currently has implementations of medoid and geometric
median with support for missing data using NaN
.
The latest version of the package is always available on pypi, so can be easily installed by typing:
pip3 install hdmedians
Given a finite set of -dimensional observation vectors , the medoid of these observations is given by
The current implementation of medoid
is in vectorized Python and can handle
any data type supported by
ndarray.
If you would like the algorithm to take care of missing values encoded as nan
then you can use the nanmedoid
function.
Create an 6 x 10 array of random integer observations.
>>> import numpy as np
>>> X = np.random.randint(100, size=(6, 10))
array([[12, 9, 61, 76, 2, 17, 12, 11, 26, 0],
[65, 72, 7, 64, 21, 92, 51, 48, 9, 65],
[39, 7, 50, 56, 29, 79, 47, 45, 10, 52],
[70, 12, 23, 97, 86, 14, 42, 90, 15, 16],
[13, 7, 2, 47, 80, 53, 23, 59, 7, 15],
[83, 2, 40, 12, 22, 75, 69, 61, 28, 53]])
Find the medoid, taking the last axis as the number of observations.
>>> import hdmedians as hd
>>> hd.medoid(X)
array([12, 51, 47, 42, 23, 69])
Take the first axis as the number of observations.
>>> hd.medoid(X, axis=0)
array([39, 7, 50, 56, 29, 79, 47, 45, 10, 52])
Since the medoid is one of the observations, the medoid
function has the ability to only return the index if required.
>>> hd.medoid(X, indexonly=True)
6
>>> X[:,6]
array([12, 51, 47, 42, 23, 69])
The geometric median is also known as the 1-median, spatial median, Euclidean minisum, or Torricelli point. Given a finite set of -dimensional observation vectors , the geometric median of these observations is given by
Note there is a subtle difference between the definition of the geometric median and the medoid: the search space for the solution differs and has the effect that the medoid returns one of the true observations whereas the geometric median can be described as a synthetic (not physically observed) observation.
The current implementation of geomedian
uses Cython and can handle float64
or float32
. If you would like the algorithm to take care of missing values
encoded as nan
then you can use the nangeomedian
function.
Create an 6 x 10 array of random float64
observations.
>>> import numpy as np
>>> np.set_printoptions(precision=4, linewidth=200)
>>> X = np.random.normal(1, size=(6, 10))
array([[ 1.1079, 0.5763, 0.3072, 1.2205, 0.8596, -1.5082, 2.5955, 2.8251, 1.5908, 0.4575],
[ 1.555 , 1.7903, 1.213 , 1.1285, 0.0461, -0.4929, -0.1158, 0.5879, 1.5807, 0.5828],
[ 2.1583, 3.4429, 0.4166, 1.0192, 0.8308, -0.1468, 2.6329, 2.2239, 0.2168, 0.8783],
[ 0.7382, 1.9453, 0.567 , 0.6797, 1.1654, -0.1556, 0.9934, 0.1857, 1.369 , 2.1855],
[ 0.1727, 0.0835, 0.5416, 1.4416, 1.6921, 1.6636, 1.6421, 1.0687, 0.6075, -0.0301],
[ 2.6654, 1.6741, 1.1568, 1.3092, 1.6944, 0.2574, 2.8604, 1.6102, 0.4301, -0.3876]])
>>> X.dtype
dtype('float64')
Find the geometric median, taking the last axis as the number of observations.
>>> import hdmedians as hd
>>> hd.geomedian(X)
array([ 1.0733, 0.8974, 1.1935, 0.9122, 0.9975, 1.3422])
Take the first axis as the number of observations.
>>> hd.geomedian(X, axis=0)
array([ 1.4581, 1.6377, 0.7147, 1.1257, 1.0493, -0.091 , 1.7907, 1.4168, 0.9587, 0.6195])
Convert to float32
and compute the geometric median.
>>> X = X.astype(np.float32)
>>> m = hd.geomedian(X)