caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
61 stars 45 forks source link

per_sink_feature_assignments output #95

Open lakarstens opened 6 years ago

lakarstens commented 6 years ago

Hi, I do not fully understand the output from running sourcetracker with the --per_sink_feature_assignments flag and am wondering if anyone can explain the results of the feature_tables that are output with this flag.

Based on the description, I anticipated the results for each sample would be a table with the number (or proportion) of sequences from each feature estimated to be from each source. However, looking at the results in the sample_feature_table.txt files, the sum of the counts in each feature column is greater than the original data and don't appear to be proportions. It seems like the total number of reads per sample in the feature tables are scaled to 10,000; but even after scaling the original data to 10,000 reads/sample (the test data has between 1,900 - 2,080 reads per sample), the number of sequences per feature do not add up, though they are closer.

Can you please provide information about what the feature tables contain? Are these results normalized in some way?

I ran sourcetracker2 with the tiny-test data provided in this repository (data/tiny-test/) using the following command: sourcetracker2 gibbs -i otu_table.txt -m map.txt --per_sink_feature_assignments -o example1b/

I installed sourcetracker2 as indicated in Issue #85

I have attached a file to show what I am referring to for two samples (s0 and s1). This contains the original and normalized OTU tables, along with the feature table output from sourcetracker2 for each of these samples (I can provide the results for all the test samples if needed). Let me know if you need any additional information.

Thanks! sourcetracker_results_s0_s1.txt

wdwvt1 commented 6 years ago

Hi @lakarstens - sorry for the slow reply! I have been very busy with other projects.

the sum of the counts in each feature column is greater than the original data and don't appear to be proportions.

The sum of a sink in the output will be draws_per_restart restarts input_sink_sum. So, if your sink had 1000 counts going in, and you had 10 draws_per_restart, and 10 restarts, your output sink would have 100,000 counts. This is because internally ST2 is just adding up the result of each draw to get the final count for each feature from each sink.

but even after scaling the original data to 10,000 reads/sample (the test data has between 1,900 - 2,080 reads per sample), the number of sequences per feature do not add up, though they are closer.

This is confusing - I agree. The data is being rarefied which is causing the features to not exactly scale with the input (e.g. there should be 50 total count for o1 in your example output if there was an input of 10 and there were 5 draws, and o2 should be similarly 100 count in that example, etc.). I ran without rarefaction on the test data (the same you used) and it performs as expected.

sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example1/ --source_rarefaction_depth 0 --sink_rarefaction_depth 0 --per_sink_feature_assignments

The output for s0 is attached. Let me know if this helps or if I can clarify more (I'll be quick to respond after I've knocked out this seminar I'm giving tomorrow).

s0.feature_table.txt

lakarstens commented 6 years ago

That makes sense, thank you for the thorough explanation! ~Lisa