caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
60 stars 45 forks source link

Output interpretation #127

Closed fra-cand closed 3 years ago

fra-cand commented 4 years ago

Hello, I don't know if this is the right section to ask this, but I would like a little help for output interpretation. My data set has two source environments and 6 sink samples. The percentage of each environment corresponds to the seqs found only in each one of them? And what about unknown? Does it contain the reads found in none of the source envs and those found in both? Or only those found in none?

Thank you for your help.

NeginValizadegan commented 3 years ago

Hi @fra-cand

Based on my understanding, what you see for the percentage of each source in each sample is in fact the proportion of features in the sink samples that are estimated to be sourced from the respective source. The Unknowns is basically everything else that was not found to be sourced from each of your two sources provided. So what it mean is that there are other sources that are contributing to the sink that are not listed in your data.

johnchase commented 3 years ago

Hi, @NeginValizadegan's explanation is correct. Ultimately sourcetracker is attempting to un-mix a sink sample. In other words if a sink sample represents a mixture of the source samples, as well as other unknown sources, what proportion would be estimated to come from each of those? The actual values do not reflect the true number of sequences, they are purely estimates. For a more detailed explanation have a look at this tutorial https://github.com/biota/sourcetracker2/blob/master/ipynb/Sourcetracking%20using%20a%20Gibbs%20Sampler.ipynb

I am closing this as there is no action item, however, please feel free to respond to the issues if needed.