caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
60 stars 45 forks source link

unbalanced data #116

Open sachasuca opened 5 years ago

sachasuca commented 5 years ago

Is it appropriate to use SourceTracker with unbalanced or missing data? I have 5 sources for my 1 sink. I collected n=52 for each sample type (52*6=312); however, some samples were discarded because they had <1000 sequences after filtering (the min. number needed to be confident we've adequately sampled the community per rarefaction plotting). Consequently, I have a total of n=289 samples (n=52 for sink, and a range of n=42-50 for sources)--and a total of n=42 "complete sets" (i.e., for each SampleID, we have data available for all 5 sources and the 1 sink).

I ran sourcetracker2 on the n=289 samples. Now wondering if the algorithm is sensitive to this unbalanced and missing data. Would you recommend running it only for "complete sets?"

lkursell commented 5 years ago

Hi @sachasuca - the general QIIME2 forum might be a good place to ask that question, since it is an applied question rather than strictly code-based.

I would run ST a few ways. 1) Use individual samples from each source as a source and then sum the results on a per source basis 2) Use all data available from sources after your QC filtering 3) Randomly sample to the minimum sample number from your sources - in this case, grab a random set of 42 samples from each source