brianstock / MixSIAR

A framework for Bayesian mixing models in R:
http://brianstock.github.io/MixSIAR/
87 stars 73 forks source link

How does MixSIAR fit the source data? What is the difference between "raw" and "means + SD + n"? #102

Open brianstock opened 7 years ago

brianstock commented 7 years ago

Dear Brian,

I have a really basic question about how MixSIAR deals with variation in the sources, and have not been able to find the answer in the documentation. I’ve been advising a whole lab of students to use MixSIAR but am worried about non-normal distributions in source data. With 2 sources, my posterior distributions look substantially different if I feed MixSIAR the full source dataset vs. reduce it to means, sds, and ns. But in the code, it looks like what MixSIAR is doing is just taking means and sds from the full source data anyway. Can you tell me (a) why the full source data vs. summarized source data produce different answers, and (b) if you know of any extensions that would use full source distributions (removing problems due to non-normal source distributions)?

Thanks very much,

brianstock commented 7 years ago

Hi Rebecca,

Please see my explanation of how MixSIAR uses/fits the source data in both cases here. MixSIAR does not just take the means and sds of the source data (although these are always calculated to make the isospace plots).

a) You should get slightly different answers with "raw" vs "means + SDs" because they're fitting different models, one with covariance and one without.

b) Not sure what you mean by "full source distributions"... The "raw" data model is "fully Bayesian", but it still assumes normality. How non-normal are your source data?

-Brian

alex-koiter commented 7 years ago

Hi Brian, Thanks for posting your explanation of the differences. I too didn't realize the model was fit differently using raw or summarized data. As a follow up question can you point me in the right direction to find examples of how to chose which method is more appropriate? I suspect in many cases people have access to the raw data and I am wondering why you would ever choose to summarize oppose to using the raw data.

On a related note, it seems like the raw data option only works with 2 tracers/isotopes are there plans to implement the raw data option for cases where there is more than 2 tracers/isotopes?

As for non-normal data I have been using transformations (e.g., log10) to make my data normal. Are there reasons I should not be using transformed data in mixing models?

Thanks, Alex

brianstock commented 7 years ago

Hi Alex,

Off the top of my head, I'm not sure why you would ever summarize either - C and N isotopes in particular are usually correlated. Models with covariance take longer to run. I vaguely remember suggesting someone try using "means + SD + n", but I don't remember the reason - perhaps the model with covariance wasn't fitting well (?). Last thought: it's easier to fix the source mean/SD (i.e. turn off source fitting) using the "means + SD + n" option by setting n = 1000 (or other large number). To do that with raw data you'd have to duplicate the source data many times, which would take the model a long time to fit.

Why does it seem like the raw source data option only works with 2 tracers? It should work for any number...

I don't think there's a reason to not transform a tracer (long as you do it for source + mix + discr!). I'll ask around though.

alex-koiter commented 7 years ago

Hi Brian, Thanks for the responses, this is very helpful. In terms of the raw data and more than 2 tracers, I get the following error: "Cannot invert matrix: not positive definite", I don't think I am the only one running into this issue (see issue #85). I am not certain how to interpret the error so I am not sure on how to begin to trouble shoot it. I have tried the raw data approach with more than one data set and ran in to the same issue each time. Thanks, Alex

rjbest commented 7 years ago

Thanks very much Brian - Alex has summarized the original reason for this question - it would be great to be able to use raw data for >2 sources but I have always got the "cannot invert matrix" error as well. Maybe that has to do with what's happening with the covariance matrix? Great to know that this is the difference between raw and summarized data - thanks for that explanation.

brianstock commented 7 years ago

Hi Alex and Rebecca,

I'm sorry about the "cannot invert matrix" error. My guess is that's because we construct the source covariance matrix using individual rho and sigmas, as opposed to giving it a multivariate prior that is guaranteed to be positive definite (like the inverse-Wishart that we use for the mixture covariance prior in the "residual error only" case). Will work on this.

Re: non-normality of sources, the source data would have to be very skewed to be a problem if you have more than a few samples. The mixing models assume that the mixture (weighted average of the source means) is normally distributed, which should be by the Central Limit Theorem.

(paraphrasing Eric Ward): I'm not sure that transforming the data doesn't have an effect on the estimation of proportions, and it may be worth doing a simulation to test this (see mixdist package). Ex: with a log transform, it's not true that the mixture of logged variables is the log of the consumer signatures.

rjbest commented 7 years ago

Hi Brian - thanks very much for working on this, and for that very helpful comment about transforming variables. Makes sense.

Rebecca

ericward-noaa commented 7 years ago

Agreed with what Brian said about the matrix not being positive definite.

There's some tricks around this - but the easiest solution may be the prior.

alintern commented 5 years ago

Hello,

I have recently come across the error that says: Cannot invert matrix: not positive definite.

I am using 16 tracers and hoping to use raw data for the sources, and I'm solving for each individual mixture (much like the cladocera example) so there is a fixed effect by mixture ID.

Is there a work around to allow me to run the JAGS model without getting the "Cannot invert matrix: not positive definite" error when using more than two tracers and raw source data?

Thank you.

MaartenWynants commented 5 years ago

I am wondering if any of you have already found a way to run the model with raw source data and more than two tracers?

alintern commented 5 years ago

Hi Maarten,

Thank you for your reply!

I made a work around - but I am not sure if that was a good idea....

Do you have any suggestions?

Thank you, Anna

On Mon, 1 Jul 2019 at 23:54, Maarten Wynants notifications@github.com wrote:

I am wondering if any of you have already found a way to run the model with raw source data and more than two tracers?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/brianstock/MixSIAR/issues/102?email_source=notifications&email_token=ALKMLAWOUFRAZ3FSBGPP3OTP5IEBVA5CNFSM4DKSXSCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY6GK5Y#issuecomment-507274615, or mute the thread https://github.com/notifications/unsubscribe-auth/ALKMLAWZYWODF6NH3ROKAZTP5IEBVANCNFSM4DKSXSCA .

-- DR ANNA LINTERN Lecturer

Department of Civil Engineering Monash University G23, 23 College Walk (Building 60) Clayton, Vic, 3800 Australia

T: +61 3 9905 4676 M: +61 422 234 413 E: anna.lintern@monash.edu w http://monash.eduww.monash.edu/engineering/annalintern Skype: anna.lintern Twitter: @AnnaLintern

Pronouns: She/hers

MaartenWynants commented 5 years ago

Hi Anna,

I managed to get it running on an older version of MixSIAR. With this version you can only include residual error (which is fine for me because I am using geochemical data from riverine/lake sediment anyway). Nonetheless, I am not sure how this older version implements the error structure and the covariance, so I'm not a 100% convinced if I can trust the results. In any case, I would prefer to use raw data, because of the high covariance between individual geochemical tracers from my riverine sources. I have also been looking through the model build code to see if I can locate the source of the error and if an easy fix would be possible, but I'm not advanced enough in R language for this I'm afraid.

Regards, Maarten

AndrewLJackson commented 5 years ago

Hi all

Brian - eigen value decomposition might be a good solution for the covariant matrix especially for more than 2 tracers. It’s pretty efficient and less restrictive than the inverse wishart. I have some code I can share if you want.

Not all that easy to just swap it out in the code though! If I get some time (I’m doing lots of coding these days) I might just go ahead and try, but will probably break a lot more than I fix.

Best wishes Andrew

––––––––––––

Dr Andrew Jackson, PhD, FTCD Associate Professor School of Natural Sciences, Department of Zoology Trinity College Dublin, the University of Dublin Dublin 2, Ireland.

+353 1 896 2728 | a.jackson@tcd.iemailto:a.jackson@tcd.ie

Twitter: @yodacomplexhttps://twitter.com/yodacomplex http://www.tcd.ie/Zoology/research/research/theoretical/AndrewJackson.php

Trinity College Dublin, the University of Dublin is ranked 1st in Ireland and in the top 100 world universities by the QS World University Rankings.

On 4 Jul 2019, at 14:59, Maarten Wynants notifications@github.com<mailto:notifications@github.com> wrote:

Hi Anna,

I managed to get it running on an older version of MixSIAR. With this version you can only include residual error (which is fine for me because I am using geochemical data from riverine/lake sediment anyway). Nonetheless, I am not sure how this older version implements the error structure and the covariance, so I'm not a 100% convinced if I can trust the results. In any case, I would prefer to use raw data, because of the high covariance between individual geochemical tracers from my riverine sources. I have also been looking through the model build code to see if I can locate the source of the error and if an easy fix would be possible, but I'm not advanced enough in R language for this I'm afraid.

Regards, Maarten

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/brianstock/MixSIAR/issues/102?email_source=notifications&email_token=AAZLLMGEYQDQNGRZUAEENG3P5X62VA5CNFSM4DKSXSCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZHPVFA#issuecomment-508492436, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAZLLMEALGJYHM2SHCWXCYDP5X62VANCNFSM4DKSXSCA.

alex-koiter commented 5 years ago

This is great that this issue is getting some traction again. Having had no luck with raw data input, I have always just defaulted to the summarized input. I am curious/worried about the implications of using correlated tracer data when using summarized data. However, if you dig in the R code it also says that MixSIAR does fit mixture data with covariance in cases with residual only error and > 1 tracer (which is the case for me). I am not sure what this means exactly. I wonder if MixSIAR can handle correlation (redundancy) using raw and summarized but can handle it best with raw data.

AndrewLJackson commented 5 years ago

Source data are assumed to come from a multivariate normal distribution. In fitting it, there is a full covariance matrix to be estimated. Pretty straight forward for two tracers but more tricky for more. I suspect it currently only works with two tracers as it estimates two variances and a correlation coefficient from which the covariance matrix can be calculated. 11 tracers means A LOT of off diagonals to be calculated... and ultimately might perform very slowly (it will be very slow in any case regardless of method chosen).

When summary statistics are provided the tracers are assumed to come from a set univariate normal distributions so no correlation. This approach is computationally much more efficient owing to there only being k parameters per source rather than k+(k^2-k)/2

I can see how correlation structures might help identify between sources that are close together (assuming their correlation is markedly dofferent) but otherwise if they are distinct or share correlation structure I can’t see it helping any.

––––––––––––

Dr Andrew Jackson, PhD, FTCD Associate Professor School of Natural Sciences, Department of Zoology Trinity College Dublin, the University of Dublin Dublin 2, Ireland.

+353 1 896 2728 | a.jackson@tcd.iemailto:a.jackson@tcd.ie

Twitter: @yodacomplexhttps://twitter.com/yodacomplex http://www.tcd.ie/Zoology/research/research/theoretical/AndrewJackson.php

Trinity College Dublin, the University of Dublin is ranked 1st in Ireland and in the top 100 world universities by the QS World University Rankings.

On 4 Jul 2019, at 17:26, Alex Koiter notifications@github.com<mailto:notifications@github.com> wrote:

This is great that this issue is getting some traction again. Having had no luck with raw data input, I have always just defaulted to the summarized input. I am curious/worried about the implications of using correlated tracer data when using summarized data. However, if you dig in the R code it also says that MixSIAR does fit mixture data with covariance in cases with residual only error and > 1 tracer (which is the case for me). I am not sure what this means exactly. I wonder if MixSIAR can handle correlation (redundancy) using raw and summarized but can handle it best with raw data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/brianstock/MixSIAR/issues/102?email_source=notifications&email_token=AAZLLMG72ZHEWPHX7QJT3YDP5YQE3A5CNFSM4DKSXSCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZHZUQI#issuecomment-508533313, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAZLLMBVRBG6CQ4PPO7XDDDP5YQE3ANCNFSM4DKSXSCA.

patrickpickett commented 2 years ago

Hi all,

Just wondering if there have been any recent updates or work arounds found for this issue?

RE: Cannot invert matrix: not positive definite

Regards,

Pat

ncttrinh commented 1 year ago

Hi everyone,

Looks like this topic has been abandoned for almost two years now.... so I guess the only way to over come this issue is to use the Mean+/-SD instead of raw data.....

Hello from Australia here :) Hope someone can contribute..........