caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
64 stars 46 forks source link

Different results from different installs #114

Open erimjr opened 5 years ago

erimjr commented 5 years ago

Hi, I've never left a comment on GitHub before so I hope I'm doing this properly.

I'm running SourceTracker2 on a MacBook. I've previously installed SourceTracker2 into a dedicated virtual environment using the Bioconda channel ($conda install sourcetracker), but that version doesn't have the --per_sink_feature_assignment flag available, which I want. I'll call this install ST2.

I installed the new master version of SourceTracker2 into a different virtual environment according to the instructions in issue #85 (using $pip install https://github.com/biota/sourcetracker2/archive/master.zip); I'll call this install ST2.1

As an aside, when I install ST2.1 the script runs fine on my MacBook (Mojave, 10.14.1), but in Ubuntu (16.04.5 LTS, xenial) I get the applymap error mentioned in issue #111 .

When I run the same data set with the same commands in ST2 and ST2.1 I get very different results (well outside of deviation). Most notably, two of my sink samples are actually my source samples and ST2.1 only finds 67 and 81 % contribution from the sinks, respectively, whereas ST2 finds >99.9 % contribution for both (which is a more accurate result).

I already know there are problems with my training data set (I need a greater diversity of sinks for more reliable results and I need more replicates), but I'm trying to establish a workflow that can be followed by other users before putting the rest of the data in.

I've attached the output mixing_proportions.txt files from each of my runs. If you want me to upload my .biom table and map I'm happy to, just tell me.

I'm running each install with no rarefaction in sources or sinks, (sourcetracker2 gibbs -i ./otu_table_second_sourcetracker.biom -m ./20181030_map_second_sourcetracker.txt -o ./st2_small_rerun --source_rarefaction_depth 0 --sink_rarefaction_depth 0) with and without the --per_sink_feature_assignment flag in ST2.1 and ST2, respectively.

Finally, the results from ST2 line up well with results from the original R script.

I think there's a question buried in there somewhere. It's something like, "I've clearly done something wrong, can you help me figure it out?"

Thanks for your help!

mixing_proportions_ST2.1.txt mixing_proportions_stds_ST2.1.txt mixing_proportions_stds_ST2.txt mixing_proportions_ST2.txt

johnchase commented 5 years ago

@erimjr If you are able to upload your table and mapping file I would be interested to play around with it. From what you have described I can't see anything immediately wrong with what you are doing

erimjr commented 5 years ago

@johnchase No worries. Here you go! GitHub tells me it won't support my .biom file type so I'll drop a text version of the table here and you can convert with biom convert? Also, just so you can see them, I'll put the results from SourceTracker 0.91 (beta), which I ran in R-Studio because I'm R-Illiterate. I couldn't get R script to run with the rarefaction depth set to 0, so I set it at 100k which seemed to work.

non_rarefied_results_first_r.txt otu_table_second_sourcetracker.txt map_second_sourcetracker.txt

johnchase commented 5 years ago

@lkursell pointed out that the alpha2 parameter default has changed between versions (the parameter that accounts for the counts of the "unknown". Explicitly setting this in the two versions should provide the same results.

I am going to close this for now as my testing suggests that this is the case for the difference, however feel free to comment if this does not provide the same outcome

erimjr commented 5 years ago

@johnchase and @lkursell Thanks so much for your time. So you're saying if I define the alpha2 parameter to be .001 I should get the same results from the new version. I'll try it out today. Thanks again! Is there any updates about when a version that doesn't have the applymap error in Linux will be merged with the master download?

johnchase commented 5 years ago

Yes, that should be the case. I am unsure about the timeline for #112

erimjr commented 5 years ago

@johnchase sorry to @ you again, I know you guys are really busy. I've rerun my small data set again with the alpha2 parameter set to .001 (which is the default in the original R code as well as in the previous version of SourceTracker2 and the results are basically identical to the run with alpha2 set at the new default (.1) I've attached the .pdf output from each (just because they're the easiest to look at), with the filename indicating the date I ran SourceTracker2. My input command was:

$sourcetracker2 gibbs -i /otu_table_second_sourcetracker.biom -m /map_second_sourcetracker.txt -o /20181205_st2.1_first --source_rarefaction_depth 0 --sink_rarefaction_depth 0 --per_sink_feature_assignments --alpha2 .001

Thanks again for your help and advice!

20181129_mixing_proportions.pdf 20181205_mixing_proportions.pdf

johnchase commented 5 years ago

I can try to take a look into this a bit deeper this weekend, I have not been able to reproduce the results that you are seeing with the development version of sourcetracker

erimjr commented 5 years ago

Hi @johnchase have you been able to find my problem? I've tried to rerun this several times with slightly different parameters, but always end up with the same results - nearly identical outputs from the original R script (which takes like 4 days to run on my machine), the SourceTracker2 script installed from Bioconda, and wildly different outputs from the SourceTracker2 script installed from Github (which still errors out on my Ubuntu machine, but will run on my MacBook). I've tried removing and reinstalling the scripts as well, with the same result. It's a bit frustrating for me since the code seems to run so much faster in Python than in R and my whole dataset is now ready for analysis. My current plan is to get the mixing proportions from my Bioconda install, then set the R script running over the Christmas/New Year holiday and hopefully it will finish. In case if helps you with troubleshooting, this .biom table is an incidence table from a microarray. The "OTU" on the table are really just probe sequences. Thanks again for your help!

johnchase commented 5 years ago

Hi @erimjr thank you for being patient! It took me a minute to track down what is going on here. Previously when I was doing my testing I was doing it all with the sourcetracker API, rather than the command line interface. This was a mistake on my part as the differences you are seeing are a direct result of running sourcetracker through the command line interface.

When you run sourcetracker through the command line interface there are a few things that happen to your data, the one that we are interested in here is that all of the sources from the same environment are collapsed into a single source sample. In the previous versions of sourcetracker the collapsing was done by summing all of the feature(OTU) counts, in the development version the collapsing is done by taking the mean of the counts. I was able to dig up some discussion about this from a while back here: https://github.com/biota/sourcetracker2/pull/51#discussion_r73055970

@lkursell @wdwvt1 Would you want to comment on the validity over summing vs. averaging sample counts? IIRC I tended to prefer to sum counts rather than take the mean, although ultimately I believe it comes down to the assumptions regarding ones data. I wonder if a solution would be to add this as a command line parameter to let the user decide?

@erimjr In the mean time I don't have a fantastic solution to be able to quickly reproduce results between versions, although one possibility would be to collapse your data (i.e. combine your source samples) prior to running sourcetracker.

The following will illustrate the differences that @erimjr is seeing

samples = pd.read_csv('map_second_sourcetracker.txt', sep='\t', index_col=0)
data = pd.read_csv('otu_table_second_sourcetracker.txt', sep='\t', index_col=0, skiprows=1).T
c = Client()
sinks = data.loc[samples[samples.SourceSink == 'sink'].index].copy()
sources = data.loc[samples[samples.SourceSink == 'source'].index].copy()
#Take the sum
sources = pd.DataFrame(sources.sum()).T
sources.index = ['Sewer']

mp, mp_std, feats = gibbs(sources, 
                          sinks)
print(mp)

sources = data.loc[samples[samples.SourceSink == 'source'].index].copy()
#take the mean
sources = pd.DataFrame(sources.mean()).T
sources.index = ['Sewer']

mp, mp_std, feats = gibbs(sources, 
                          sinks)
print(mp)
erimjr commented 5 years ago

Thanks @johnchase and @lkursell! So if I understand correctly, there is a difference in the way the source data is being collapsed in the API and cli versions. The API is taking a mean of the counts in replicate sources but the cli version is summing those counts. You think if I collapse my source samples prior to putting the table into sourcetracker, that might fix the problem I'm seeing.

I can try it! My only question would be: since my data is incidence data, I guess using the mean number of counts in replicate sources would give a better representation of the source data. However, I think I remember reading somewhere in the documentation that the counts need to be integers, but maybe I'm wrong since I can't seem to find that advice again after looking. Thanks again for your patient help :)

johnchase commented 5 years ago

So if I understand correctly, there is a difference in the way the source data is being collapsed in the API and cli versions.

Not exactly, when I ran sourcetracker through the API there is no formatting of the data that takes place by sourcetracker, it is all done manually. When I collapsed the data manually I made the mistaken assumption that the command line sourcetracker call was collapsing samples by summing them, this it turns out is the difference in the results between the current and previous versions of sourcetracker.

since my data is incidence data, I guess using the mean number of counts in replicate sources would give a better representation of the source data

If this is the case, and you wish to use the mean then you don't need to do anything. The latest version of sourcetracker on github will automatically do this. If you wish to collapse by summing the features of each sample, you can do this prior to running sourcetracker, then when you run sourcetracker if there is only one sample per source, no collapsing will take place

erimjr commented 5 years ago

@johnchase Ok I think I get it now. So I guess the R version of sourcetracker and the version on Bioconda are summing the counts in replicate sources, which seems to give a more "correct" answer in my case? That seems strange to me, considering it's an incidence table and I would guess that by simply adding all the 1's together would make source samples with more replicates look very different from the sink samples? Clearly I'm missing something, though, since the result seems to come out better this way. Thanks again for your help. I'll probably not get to this before the new year, but I'll let you know what the output looks like once I do!

johnchase commented 5 years ago

Great, I will be interested to hear what you find!

erimjr commented 5 years ago

@johnchase @lkursell Hi! I've run my data through the cli Sourcetracker2 again in a few different iterations. The long and short of it is that collapsing the source data by sum and by average produce wildly different results. In my case, I got a "correct" result from the summation. I interpret this result as correct because it ascribes ~100% contribution of the sinks back to the sinks (when I enter the sink data as source data as well). You can see in the attached .pdf that there is little effect from changing the alpha2 parameter from 0.1 to .001 in my case. Thanks so much for all your time and help! I like the idea of making the source collapse method an option that users can change with a flag at the command line in a future release (just to save me time doing it myself!) Thanks again, you guys do great work and most of us can only marvel at what you achieve. Matt source_collapse_sum_and_average.pdf

erimjr commented 5 years ago

Now I just need to figure out how to interpret the per sink feature assignments 👍

johnchase commented 5 years ago

Thanks for the update. This is what I had seen with your data as well. I'm fascinated by this is we had tended to not see large differences as a result of how sources were collapsed. I think I still have much to understand about how 16S data is interpreted by sourcetracker!