caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
61 stars 45 forks source link

Alternative mixing proportions algorithm #83

Closed lkursell closed 6 years ago

lkursell commented 7 years ago

This PR introduces method5, which calculates the percent contributions of sequences to a sink using only the probability of seeing a feature the source. This method does not create an Unknown source and does not use the gibbs sampling approach. This method is only exposed via a private API, and can be used for testing sourcetracker scripts without waiting on the longer gibbs call to run.

@gregcaporaso @wdwvt1 we need a more appropriate and descriptive name for this method, and I'm open for suggestions. I'll email Dan as well to get his naming input.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.03%) to 99.257% when pulling 4e512d6592892a50ab27a49d8232a7d2bd715eb0 on lkursell:method5 into ef3d7adf50ed9e92a914e9e3dcd55bfc1c46d859 on biota:master.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.03%) to 99.254% when pulling f164327e875b3f396c02a07df2a2b3d691ed4d28 on lkursell:method5 into ef3d7adf50ed9e92a914e9e3dcd55bfc1c46d859 on biota:master.

johnchase commented 6 years ago

@lkursell Can you elaborate on the math behind this function?

For example:

import numpy as np
import pandas as pd
sources = pd.DataFrame(np.random.randint(0, 10, (2, 10)))
sinks = pd.DataFrame(sources.sum()).T
method5(sources, sinks)

Gives the expected result:

    0   1
0   0.502564    0.497436

However if I multiply one of the sources by 10 before combining it into the sink:

sinks = pd.DataFrame(sources.iloc[0] *10 + sources.iloc[1]).T
method5(sources, sinks)
     0         1
0    0.587445    0.412555

I would expect the results to approximately reflect this, they don't seem to. gibbs does seem to recover approximately the correct mixing proportions

mpm, mps, fas = gibbs(sources, sinks)
mpm

    0   1   Unknown
0   0.806011    0.08816 0.105829