Closed lkursell closed 6 years ago
@lkursell Can you elaborate on the math behind this function?
For example:
import numpy as np
import pandas as pd
sources = pd.DataFrame(np.random.randint(0, 10, (2, 10)))
sinks = pd.DataFrame(sources.sum()).T
method5(sources, sinks)
Gives the expected result:
0 1
0 0.502564 0.497436
However if I multiply one of the sources by 10 before combining it into the sink:
sinks = pd.DataFrame(sources.iloc[0] *10 + sources.iloc[1]).T
method5(sources, sinks)
0 1
0 0.587445 0.412555
I would expect the results to approximately reflect this, they don't seem to. gibbs does seem to recover approximately the correct mixing proportions
mpm, mps, fas = gibbs(sources, sinks)
mpm
0 1 Unknown
0 0.806011 0.08816 0.105829
This PR introduces
method5
, which calculates the percent contributions of sequences to a sink using only the probability of seeing a feature the source. This method does not create anUnknown
source and does not use thegibbs
sampling approach. This method is only exposed via a private API, and can be used for testing sourcetracker scripts without waiting on the longergibbs
call to run.@gregcaporaso @wdwvt1 we need a more appropriate and descriptive name for this method, and I'm open for suggestions. I'll email Dan as well to get his naming input.