caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
62 stars 45 forks source link

add functions for comparing sourcetracker results #57

Closed gregcaporaso closed 8 years ago

gregcaporaso commented 8 years ago

The upcoming SourceTracker 2 comparisons are going to require a function (e.g., compare_sourcetracker_results) that takes an observed_composition and expected_composition of a sample, and returns a single value indicating their similarity. This function should also take a metric parameter which parameterizes how similarity is computed. This is similar in nature to the scikit-bio beta_diversity driver function.

Some of the metrics that we'd want to be able to use are:

I envision observed_composition and expected_composition being dicts mapping source environment type to proportion.

This function would create a pairwise similarity, so we'd also want to be able to summarize many pairwise similarities (e.g., for multiple different samples). We could start with np.median for this, and get more fancy when we need to.

@wdwvt1, you had some other ideas about metrics that would be useful for this. Could you add those here?

I could add this pretty quickly if this sounds like the right plan. It's kind of holding me up on an internal biota project right now. We could start with this being a private internal API and explore making it public another time.

cc @johnchase @wdwvt1 @lkursell

wdwvt1 commented 8 years ago

This sounds like an excellent idea.

I think we should use dataframes rather than dicts because it will work easily with the output of _gibbs which is a dataframe having rows, columns = sinks, sources.

I think we should have:

def absolute_distance(obs, exp):
    return np.abs(obs - exp)
gregcaporaso commented 8 years ago

What about observed_composition and expected_composition being pd.Series instead of pd.DataFrame? I see this method as operating on exactly one sample, and we provide its observed composition and its expected composition.

wdwvt1 commented 8 years ago

That seems good to me - easy to move between the two and that logic makes a lot of sense.

gregcaporaso commented 8 years ago

@wdwvt1, do you have or know of any implementations of Jenson Shannon Distance in Python?

wdwvt1 commented 8 years ago

@gregcaporaso - I do not, though the pairwise implementation I wrote is simple. Note, this is the Jensen-Shannon divergence (not a distance metric). The distance metric version is just the square root of the Jensen-Shannon divergence.

This function doesn't check all the conditions that need to be checked for pk and qk; they must be probability distributions (i.e. sum to 1, be non-negative). If they do not sum to 1, scipy.stats.entropy will normalize them and throw away all values after which a sum of 1 is reached.

from scipy.stats import entropy
def jensen_shannon_pairwise_divergence(pk, qk, base=2):
    '''Compute Jensen Shannon divergence between `pk` and `qk`.

    Parameters
    ----------
    pk : np.array
        Array of floats representing a probability distribution (sums to 1).
    qk : np.array
        Array of floats representing a probability distribution (sums to 1).
    base : real, optional
        Base of the logarithm to use.

    Returns
    -------
    float

    Raises
    ------
    ValueError
        If `pk` or `qk` are simultaneously 0.
    '''
    if any((pk == 0) & (qk == 0)):
        raise ValueError('Both `pk` and `qk` are simultaneously zero in at '
                         'least one index. This causes an undefined JSD.')
    m = (pk + qk) / 2.
    return .5 * (entropy(pk, m, base=base) + entropy(qk, m, base=base))

Tests

from unittest import TestCase, main
import numpy as np
from scipy.stats import entropy

class TestJSD(TestCase)
    def jsd_small(self):
        # Test vs a calculation done with Kullback-Leibler divergence for 2
        # members. This formula for KSD is taken from Wikipedia
        # https://en.wikipedia.org/wiki/Jensen-Shannon_divergence
        x = np.array([.4, .1, .2, .3])
        y = np.array([.2, .25, .3, .25])
        m = (x + y) / 2.
        exp = (entropy(x, m, base=2) + entropy(y, m, base=2)) / 2.
        obs = jensen_shannon_pairwise_divergence(x, y, base=2)
        self.assertEqual(obs, exp)

    def jsd_large(self):
        # Test on two random vectors
        x = np.random.uniform(size=100)
        y = np.random.uniform(size=100)
        x = x / x.sum()
        y = y / y.sum()
        m = x + y
        exp = (entropy(x, m, base=2) + entropy(y, m, base=2)) / 2.
        obs = jensen_shannon_pairwise_divergence(x, y, base=2)
        self.assertEqual(obs, exp)

    def jsd_by_hand(self):
        # Test on hand calculated example.
        x = np.array([.3, .6, .05, .05])
        y = np.array([.1, .2, .6, .1])
        exp = .3172067795
        obs = jensen_shannon_pairwise_divergence(x, y, base=2)
        self.assertAlmostEqual(obs, exp)

We also want to add the non-pairwise version, which is also simple, but I haven't done it yet.

gregcaporaso commented 8 years ago

@wdwvt1, what if we merge the current PR (#64) and then add this method (and the non-pairwise, when it's been developed)?