Closed gregcaporaso closed 8 years ago
This sounds like an excellent idea.
I think we should use dataframes rather than dicts because it will work easily with the output of _gibbs
which is a dataframe having rows, columns = sinks, sources.
I think we should have:
def absolute_distance(obs, exp):
return np.abs(obs - exp)
What about observed_composition
and expected_composition
being pd.Series
instead of pd.DataFrame
? I see this method as operating on exactly one sample, and we provide its observed composition and its expected composition.
That seems good to me - easy to move between the two and that logic makes a lot of sense.
@wdwvt1, do you have or know of any implementations of Jenson Shannon Distance in Python?
@gregcaporaso - I do not, though the pairwise implementation I wrote is simple. Note, this is the Jensen-Shannon divergence (not a distance metric). The distance metric version is just the square root of the Jensen-Shannon divergence.
This function doesn't check all the conditions that need to be checked for pk
and qk
; they must be probability distributions (i.e. sum to 1, be non-negative). If they do not sum to 1, scipy.stats.entropy
will normalize them and throw away all values after which a sum of 1 is reached.
from scipy.stats import entropy
def jensen_shannon_pairwise_divergence(pk, qk, base=2):
'''Compute Jensen Shannon divergence between `pk` and `qk`.
Parameters
----------
pk : np.array
Array of floats representing a probability distribution (sums to 1).
qk : np.array
Array of floats representing a probability distribution (sums to 1).
base : real, optional
Base of the logarithm to use.
Returns
-------
float
Raises
------
ValueError
If `pk` or `qk` are simultaneously 0.
'''
if any((pk == 0) & (qk == 0)):
raise ValueError('Both `pk` and `qk` are simultaneously zero in at '
'least one index. This causes an undefined JSD.')
m = (pk + qk) / 2.
return .5 * (entropy(pk, m, base=base) + entropy(qk, m, base=base))
Tests
from unittest import TestCase, main
import numpy as np
from scipy.stats import entropy
class TestJSD(TestCase)
def jsd_small(self):
# Test vs a calculation done with Kullback-Leibler divergence for 2
# members. This formula for KSD is taken from Wikipedia
# https://en.wikipedia.org/wiki/Jensen-Shannon_divergence
x = np.array([.4, .1, .2, .3])
y = np.array([.2, .25, .3, .25])
m = (x + y) / 2.
exp = (entropy(x, m, base=2) + entropy(y, m, base=2)) / 2.
obs = jensen_shannon_pairwise_divergence(x, y, base=2)
self.assertEqual(obs, exp)
def jsd_large(self):
# Test on two random vectors
x = np.random.uniform(size=100)
y = np.random.uniform(size=100)
x = x / x.sum()
y = y / y.sum()
m = x + y
exp = (entropy(x, m, base=2) + entropy(y, m, base=2)) / 2.
obs = jensen_shannon_pairwise_divergence(x, y, base=2)
self.assertEqual(obs, exp)
def jsd_by_hand(self):
# Test on hand calculated example.
x = np.array([.3, .6, .05, .05])
y = np.array([.1, .2, .6, .1])
exp = .3172067795
obs = jensen_shannon_pairwise_divergence(x, y, base=2)
self.assertAlmostEqual(obs, exp)
We also want to add the non-pairwise version, which is also simple, but I haven't done it yet.
@wdwvt1, what if we merge the current PR (#64) and then add this method (and the non-pairwise, when it's been developed)?
The upcoming SourceTracker 2 comparisons are going to require a function (e.g.,
compare_sourcetracker_results
) that takes anobserved_composition
andexpected_composition
of a sample, and returns a single value indicating their similarity. This function should also take ametric
parameter which parameterizes how similarity is computed. This is similar in nature to the scikit-biobeta_diversity
driver function.Some of the metrics that we'd want to be able to use are:
I envision
observed_composition
andexpected_composition
being dicts mapping source environment type to proportion.This function would create a pairwise similarity, so we'd also want to be able to summarize many pairwise similarities (e.g., for multiple different samples). We could start with
np.median
for this, and get more fancy when we need to.@wdwvt1, you had some other ideas about metrics that would be useful for this. Could you add those here?
I could add this pretty quickly if this sounds like the right plan. It's kind of holding me up on an internal biota project right now. We could start with this being a private internal API and explore making it public another time.
cc @johnchase @wdwvt1 @lkursell