brianstock / MixSIAR

A framework for Bayesian mixing models in R:
http://brianstock.github.io/MixSIAR/
94 stars 75 forks source link

Sample size and including or excluding sources #379

Open Mint204 opened 5 months ago

Mint204 commented 5 months ago

I had a question about sample size and when to include a source.

I have a model I'm building with about 40 animals hair cut up into 4 sections (seasons). So there's a random effect for individual and 1 fixed effect with 4 levels. I have 8 possible source groups for this one carnivore.

I'm worried about overparametizing the model, but from what I can see, it doesn't make sense to combine any of the sources any further. I know Bayesian models can handle smaller sample sizes than frequentist stats, but how do I know if I've asked too much of the model? Is there a good rule of thumb?

Also, at what point is it acceptable to leave a possible source out? I know the models expect to have "every" source, but what are the limits? For example, is it acceptable to leave a source out if it is only found a few odd times during necropsies, or if it is only found 1% of the time?

Sources online example
naalipalo commented 5 months ago

I am by no means an expert, but Ive been told you shouldn't have more than 5 prey items. Also, prey items that are "rare" or as you indicated <1% that you find wouldn't be helpful. At one point in the MixSIAR instructions, it actually says something about not showing prey items that make up <1%. I have been told a conservative thought is to not bother with anything <5% of the diet. You have to figure all the error associated with every step you've taken thus far to get to the analysis part. Are your C and N measured from prey items that were collected in the same year, same location as your consumer? Is the machine calibrated correctly to read the values, is there contamination at some point in your sampling? and very importantly, are your TDF values absolutely correct? This process/analysis is not precise. Again, Im not an expert, but I have been advised along these lines. whether or not you take the 5% or 1% is up to you, but Mixing models cannot handle more than 5 prey items. There are some formal-sih ways to assess your mixing space. Look at Smith et al. 2013 - To fit or not to fit Evaluating Stable Isotope mixing models using simulated mixing polygons. It defines your mixing space and then you censor out the consumers that do not fit. Or you can do a permanova between each pair to confirm that they are statisically distinct, or do a KNN where it can make groups for you. From your graph it doesn't look like source 8, 3, 4 are having a huge impact on your consumers. You also have a lot of overlap from several of your prey species. You will have a really hard time trying to distinguish between those prey items. a KNN would help you and likely tell you that you'd have to group pray 1,2,7 or something like that...You should look at some of the papers on best practices for setting up mixing models. There are at least 2 good ones out there.

AndrewLJackson commented 5 months ago

There are a lot of questions here but I will try to answer / direct you to answers. Most of these are addressed in our paper Best practices for use of stable isotope mixing models in food-web studies and I would strongly encourage you to read this carefully and follow the references for more information where required.

Omitting a source: this may have almost zero effect or it might have a large effect even if it's <5% of the diet. See Point 6 in that "Best practices" paper.

Combining sources: there is not settled advice here and even among the authors of MixSIAR we disagree sometimes. But in general, a priori aggregation of samples beyond what is sensible to the user is to be avoided in favour of a posteriori aggregation. See point 7. For some additional information, we of course choose to aggregate individual sources samples often into groups by species, but one could easily split a species into two groups for males/females, or based on location or season. Similarly, one could aggregate species into functional groups that made sense, e.g. "green algae" or "zooplankton". The choice is yours. Personally I would not recommend looking at clustering models / (per)manovas to guide a priori aggregation. Instead make the decision based on biological / ennvironmental reasoning and then perform a posteriori grouping that is inline with your hypotheses - see point 1 of "best practices" which often gets overlooked much to my frustration.

How many sources: You can fit as many sources as you like, but whether you will be able to make sense of it the output will depend your question, the geometry of the sources and which if any you choose to combine a posteriori. See also Statistical basis and outputs of stable isotope mixing models: Comment on Fry (2013)

How many source samples: see point 3 in "Best practices"

AndrewLJackson commented 5 months ago

I had a question about sample size and when to include a source.

I have a model I'm building with about 40 animals hair cut up into 4 sections (seasons). So there's a random effect for individual and 1 fixed effect with 4 levels. I have 8 possible source groups for this one carnivore.

I'm worried about overparametizing the model, but from what I can see, it doesn't make sense to combine any of the sources any further. I know Bayesian models can handle smaller sample sizes than frequentist stats, but how do I know if I've asked too much of the model? Is there a good rule of thumb?

Also, at what point is it acceptable to leave a possible source out? I know the models expect to have "every" source, but what are the limits? For example, is it acceptable to leave a source out if it is only found a few odd times during necropsies, or if it is only found 1% of the time?

Sources online example

In direct reply to your original question, it sounds to me like you are being very sensible and I would just keep going! Rather than omit a source that is unlikely a major component of diet, you could use an informative prior instead of the usual vague prior. This way the prior information / knowledge you have that it is likely rare would be reflected in the model fitting process. Chiaradia et al illustrate this.

Specifying priors in MixSIAR is achieved at run time via something like run_model(..., alpha.prior = c(1, 3, 3, 3, 3)) which for 5 sources would down-weight the 1st source relative to the other 4. There is a nice animation on the wiki page Dirichlet Distribution to help see how you might pick alpha values.

Mint204 commented 5 months ago

Thank you! These comments are all very helpful. I will have to look further into those papers and reread the best practices paper.