achauffou / how-random

MSc project on the predictability of ecological interactions carried out at ETH Zürich in 2021.
0 stars 0 forks source link

Simulating the pollination logit model #7

Open achauffou opened 3 years ago

achauffou commented 3 years ago

Unsuccessful simulation of the first pollination logit model: Simulations of the first model for pollination interactions yield unexpected results. The errors between posterior parameter distributions and their true values are very large. Below are plots for the error distributions of parameters versus the amount of data available for the given site, plant or pollinator:

param_errors_alpha.pdf param_errors_beta_pla.pdf param_errors_beta_pol.pdf param_errors_lambda.pdf param_errors_nu.pdf

A first guess is that it seems that there might be high multi-collinearity between some variables in the model.

achauffou commented 3 years ago

The first two attempts that I made is to simplify the model (both data generation and Stan model) to include less parameters:

Removing degree and bioclimatic suitability: When removing degree and bioclimatic suitability (leaving site and species intercepts in the model), the errors remain very large.

Removing all parameters except the site-specific intercept: When removing all variables except the site-specific intercept (so the model is now very simple), the error distributions of the alpha parameter are closer to what I would expect from the model: param_errors_alpha.pdf Confidence intervals are very large for sites with few data, and they get smaller for sites with more data. The posterior means are also much closer to the true parameter value for sites with many data. I am still somehow surprised by the amount of data required to have reasonable estimates, but that might be because the model has become quite simple. Yet, with 100 interactions the error confidence interval is much lower than could be expected at random, suggesting that the model is (at least partly) successful.

Conclusion: I should try to make a model without species-specific intercepts but including degree and suitability. If the results are similar to my previous attempt (with site intercept only), it could mean that the problem comes from the species-specific intercepts.

achauffou commented 3 years ago

Removing species-specific intercepts: When removing species-specific intercepts from the model (leaving site intercept, degree and suitability), the outcome is unsatisfactory. By looking at the error graphs for lambda and mu, I have a feeling that their errors might be correlated: param_errors_lambda.pdf param_errors_nu.pdf

achauffou commented 3 years ago

Including only species-specific intercepts: When including only species-specific intercepts in the model, the results are unsatisfactory: param_errors_beta_pla.pdf param_errors_beta_pol.pdf

achauffou commented 3 years ago

Extreme collinearity between pool mean and level-specific parameter: I have found the likely origin of the very large confidence intervals (at least for the beta parameters, but likely for all others as well). It comes from the multilevel nature of parameters. Indeed, there are extreme correlations between the parameter of the average value of the parameter across all species/sites and the species-specific intercept. For example, in the model with species intercepts only, the correlation between beta_pol_bar and beta[x] is always higher than 0.95! I believe it is the reason why the confidence intervals of beta_pla and beta_pla_bar are very large as illustrated in the figures below: beta_pla The black dot is the mean of the posterior distribution; the red (resp. black) bar is the 80% (resp. 95%) confidence interval of the posterior distribution; the blue dot is the parameter true value beta_pla_bar_and_sigma The black dot is the mean of the posterior distribution; the red (resp. black) bar is the 80% (resp. 95%) confidence interval of the posterior distribution; the true beta_pla_bar and sigma_l are 0 and 1 respectively

achauffou commented 3 years ago

I am not sure whether this extreme correlation between the pool mean and parameter values is normal or not. It seems logical that the higher the mean the higher the specific values. I will try to investigate it and see where the problem might be. In all cases, it is definitely indicative of an error that the pool mean (which should be 0) has such a large confidence interval.

bernibra commented 3 years ago

Try fixing the mean of the prior distributions to zero. So that beta_pla_bar is 0 by design.

achauffou commented 3 years ago

Thanks for the advice, I will try it and update you with the results. Also, I just realised the correlation between the beta_pol_bar and beta_pla_bar is -0.97, so I guess they are at the source of the collinearity problem...

bernibra commented 3 years ago

Oh I see. It seems that the way we are generating the networks might be producing weird networks... I can't tell where exactly the problem comes from but you are definitely on the right track. Keep digging into the way the data is generated. Also, it could be that the model is highly overparametrized. So if our simulated data is producing very small networks, we might have very few datapoints relative to the number of parameters. I would count parameters and datapoints, and make sure that the latter is grater than the former.

achauffou commented 3 years ago

Thanks a lot for the help, I think we are on right track. I just ran a quick simulation with only species-specific intercepts with the means fixed to zero as you suggested, and it is much better already (although it is not perfect but it might be because I have generated few data to spare some time): param_errors_beta_pla.pdf param_errors_beta_pol.pdf

The distributions of the beta parameters are still quite spread, but much less than before. My understanding is that when using two (or more) pool means, those means are collinear because the means can compensate one another: p = alpha[site_id] + beta_pla[pla_id] + beta_pol[pol_id] = alpha_bar + (alpha[site_id] - alpha_bar) + beta_pla_bar + (beta_pla[pla_id] - beta_pla_bar) + beta_pol_bar + (beta_pol[pol_id] - beta_pol_bar) (I am not sure this equation makes a lot of sense for others but weirdly it does to me...) From what I understand, there can be only one pool mean that is not fixed at most (alpha would likely make sense) if we want to avoid this issue.

I will try to gain a better understanding of the problem and fix the model, I will keep you updated if there is any other issue (I guess one issue can hide another one) or if I end up with a simulation that seems to yield strong results.

bernibra commented 3 years ago

That's 100% right, spot on! Since all the data is supposed to be standardized, I would just set all "*_bar" parameters to zero. Then, check how the model behaves again with a bit more data. I think you've found the problem, and it is certainly due to a bad design of the model.

achauffou commented 3 years ago

Awesome, I will keep you updated then. By the way, I just found another error that I did when generating data: I multiplied lambda by the suitability and nu by the degree whereas I did the opposite in the Stan model, which means I compared the wrong posterior and true values... With these two issues fixed I am confident the model will already behave better.

bernibra commented 3 years ago

Fantastic! Looking forward to seeing the new results for the simulated data.

achauffou commented 3 years ago

22e1792add989546ad5a1c8db6017e32612aab79

bernibra commented 3 years ago

I was just thinking about this. The "_bar" parameters for lambda and nu (suitability and degree) should not be set to zero. In any case, I think it's good that we try the different formulations of the models, and multiple simulation exercises are yet to come. But having the "_bar" parameters, when possible (obviously not for the case of beta), is useful because it allows us to understand the mean effects of each covariate. If you do set them to zero though, it is very important that suitability values and relative degrees are standarized.

achauffou commented 3 years ago

I was just about to write you an update on how things are going with the simulations. I agree that setting the pool mean for slopes to zero is probably wrong (I would actually expect that these slope are on almost if not always positive with non-standardized data). I have just simulated a model for that. I will post the results once I have had a look at them.

achauffou commented 3 years ago

Edit: I fixed an error in the plots for gamma_pla and gamma_pol. Here are the results on the simulations I have made since yesterday:

Model with intercepts pool means equal to 0: I have simulated a model with the pool means of all parameters equal to 0, as shown here with the non-centered parametrization (note that I have changed to parameters letters slightly): p[site_id, pla_id, pol_id] = zbeta[site_id] * sigma_beta + gamma_pla[pla_id] * sigma_gamma_pla + gamma_pol[pol_id] * sigma_gamma_pol + zlambda[site_id] * sigma_lambda * S[site_id, pla_id] * S[site_id, pol_id] + znu[site_id] * sigma_nu * D[pla_id] * D[pol_id] It yields the following error plots, which are much better than before: param_errors_beta.pdf param_errors_gamma_pla.pdf param_errors_gamma_pol.pdf param_errors_lambda.pdf param_errors_nu.pdf In my opinion, the model seems to be specified correctly, although it seems that clusters with too few datapoints can yield inaccurate posterior distributions. I am still surprised that although accuracy depends on the amount of data, the size of the confidence intervals does not seem to depend strongly on the amount of data. I have no good explanation for that yet.

Adding an overall intercept: When generating data for the previous model, I ensured that the all pool means are all equal to 0, which means that the model believes that the probability of interaction when we don't know the partners and site is 50%. However, I think that with real data it is not necessary true (network connectance is usually <0.5). So I accounted for this by adding an overall intercept (and generated data for which the overall intercept is not necessarily 0): p[site_id, pla_id, pol_id] = alpha + zbeta[site_id] * sigma_beta + gamma_pla[pla_id] * sigma_gamma_pla + gamma_pol[pol_id] * sigma_gamma_pol + zlambda[site_id] * sigma_lambda * S[site_id, pla_id] * S[site_id, pol_id] + znu[site_id] * sigma_nu * D[pla_id] * D[pol_id] Here are the results: param_errors_alpha.pdf param_errors_beta.pdf param_errors_gamma_pla.pdf param_errors_gamma_pol.pdf param_errors_lambda.pdf param_errors_nu.pdf I think they look similar to the ones before and I would thus favor this model over the previous one.

With pool mean not necessarily zero for slope parameters: As you mentioned, I have also made a simulation with lambda_bar and nu_bar not necessarily zero (although I still generate data with their mean that is supposed to be zero): param_errors_alpha.pdf param_errors_beta.pdf param_errors_gamma_pla.pdf param_errors_gamma_pol.pdf param_errors_lambda.pdf param_errors_nu.pdf It does not look very bad, but not very good either. I am not sure the inaccuracy/imprecision comes from an issue or the fact that more data is needed in order for such model with many parameters to work. I have not checked yet how the posterior distributions are correlated and so on, I will keep you updated.

bernibra commented 3 years ago

It still feels like there is something off. Keep messing around with the model and see how it behaves. I'll also try to check the code this Sunday, and we can brainstorm some more on Monday. Also, if you've got time, try understanding the predictive power of the model (Bayesian R**2 or AUC). But don't stress if you don't have time.

achauffou commented 3 years ago

I uploaded to a polybox folder the data, fit objects and few plots I have made. I think you should have access to the folder. The models are named as follows:

By the way, some of the previous plots were not ordered properly (for plant/pollinator effect). I have corrected that in the plots on polybox. I will keep trying to understand what issues there might be with the models tomorrow and Monday morning.

achauffou commented 3 years ago

Intercept parameters are biased towards zero: I have performed two simulations that illustrate well in my opinion why there is a bias in intercept parameters.

The first model has 43 plants, 87 pollinators and 15 sites and includes only beta and gamma parameters (with all possible interactions included): param_errors_beta.pdf param_errors_gamma_pla.pdf param_errors_gamma_pol.pdf param_posttrue beta.pdf param_posttrue gamma_pla.pdf param_posttrue gamma_pol.pdf

In another model with fewer data (only 17 plants, 34 pollinators and 6 sites), I have included a gamma intercept only for plants (pollinators have no impact on the result): param_errors_beta.pdf param_errors_gamma.pdf param_posttrue beta.pdf param_posttrue gamma.pdf

In both models, there are clear biases in the posterior distributions (the first one between the two gamma and the second between beta and gamma). When comparing for all parameters the posterior mean (across all individuals or sites) to the true parameters average, I found out that this bias systematically centers the posterior 'pool mean' around zero (as we forced the model to), that is posterior pool means are closer to zero than true parameters. So I believe that the bias arises because each true parameter (sampled at random) does not average exactly to zero just by chance (we would need many sites, pollinators and plants for the average to tend to zero). And when one parameter true average is positive while another other is negative, the model will sample distributions closer to zero for both.

In conclusion, I am not sure whether this bias toward zero is very problematic since (I will check) that the more sites, plants and pollinators, the more the averages of their true parameters will tend to zero and thus the lower the bias. Additionally, the model predictive power and the relative influence of different parameters on predictability should not be affected by this bias. Anyway, I am not sure how we could fix this bias. I will try to run the model again with more sites/plants/pollinators and also with varying standard deviations for sampled parameters. I will keep you updated.

bernibra commented 3 years ago

I see what you mean, and it makes sense. I don't think that it's problematic, but I like the breakdown you made as we now understand what's happening. The key here is to think in terms of the comparisons that we are making, because this is only a problem for comparing simulated data. First, imagine that you had all three intercepts (beta for sites, and gamma for plants and pollinators) generated with a non-centered normal distribution. All the 'centers' would be picked up by the "alpha" parameter (the general intercept), so that alpha=gamma_bar_pollinator+gamma_bar_plant+beta_bar. The model could still do a great job predicting the data, but if we were to compare the posterior distributions for gamma and beta to the true non-centered values, we would see big differences. The reason for those differences is not that our model is not doing well. Instead, the reason for the differences is that we are not really comparing the same thing. The betas from the model are not the same as the betas that we used to simulate data (as we are missing their centre!). We will not be able to compare the posteriors for parameters such as beta to their true non-centered values, because there is an unidentifiability problem (we don't know how to break alpha down into the different bars). The good news is that this is not a problem.

For the sake of comparing our simulated vs estimated parameters, you just need to centre the simulated parameters before comparing them to the model posteriors. That is, before you compare the model posteriors to the simulated betas, you need to do something like 'scale(betas, center=T, scale=F)'. That would be the right comparison to make (I think!). For the sake of using our model for the empirical data, we do not need to worry about this. So the gammas will be useful to compare differences between species, and it doesn't matter whether or not these are centred (same for the betas).

The good thing is that we now have a good understanding of the model's behaviour. Also, we know we need an alpha that captures all the "centres" for gammas and betas, and we know that lambda and nu need a centre, and SS and DD need to be standardised. It would be good to understand how good the predictions are using AUC or R**2. Then, we can try to add an unknown predictor "Z" and see how the AUC or R2 change as you increase the "slope" parameter controlling for this "Z" (we can talk this further).

achauffou commented 3 years ago

Awesome, thanks for the explanation it makes sense. I am glad we got to the bottom of this. I like your suggestion of centering the true parameters and will make sure it is explained in the manuscript when I write about that. I will let you know as soon as I have news regarding the Bayesian R² and AUC.

bernibra commented 3 years ago

Nice! Awesome job Alain. Also, maybe we shouldn't be using graphs such as "param_errors_gamma.pdf" to compare things and focus instead on graphs like "param_posttrue gamma.pdf" (I think they are way easier to read).

achauffou commented 3 years ago

2202dc50d96e7155fc483457cebefaf38305ff3f 315e79f554842a5964419ce8c5b10d4915a655ca ad2eada1ee1fdd8a49731522361c092a3dd3418c f418b677300abcd5a182cf27459dfeb7825cd391 11dc3cfb0a37ba8239dff9e5d630146332c3ba4e 88fc547f8389d0abe1c88ca59d55a64a48379a3a

achauffou commented 3 years ago

As promised, I post the results of the models outcomes.

Model with intercepts only: Only alpha, beta, gamma_pla, gamma_pol with 20 sites, 18 plants, 34 pollinators (all interactions included): param_posttrue alpha.pdf param_posttrue beta.pdf param_posttrue gamma_pla.pdf param_posttrue gamma_pol.pdf param_posttrue sigma_beta.pdf param_posttrue sigma_gamma_pla.pdf param_posttrue sigma_gamma_pol.pdf

It seems that the model is doing well considering the amount of data included.

Model with intercepts and slopes, all interactions: With all parameters (I haven't removed degree yet) and including 20 sites, 18 plants and 17 pollinators (all interactions included): param_posttrue beta.pdf param_posttrue gamma_pla.pdf param_posttrue gamma_pol.pdf param_posttrue lambda.pdf param_posttrue lambda_bar.pdf param_posttrue nu.pdf param_posttrue nu_bar.pdf param_posttrue sigma_beta.pdf param_posttrue sigma_gamma_pla.pdf param_posttrue sigma_gamma_pol.pdf param_posttrue sigma_lambda.pdf param_posttrue sigma_nu.pdf

Again, I don't see any issue in the model. The posterior estimations don't look biased and the confidence intervals are reasonable (I think they match well the uncertainty that comes from the amount of data).

Model with intercepts and slopes, varying amount of plant/pollinators are included in each network: I am not sure I have solved the sampling issue that we talked about on Monday, I have trouble conceptualizing the issue we discussed about. But I am currently trying to generate data with limited number of interactions per site (to end up with similar number of sites, plants, pollinators and interactions as for the real dataset). I do not show you the results now because I have just realised that I made a horrible mistake at some point...

bernibra commented 3 years ago

These posterior distributions look very healthy, pretty confident that the models is doing what it's supposed to be doing.

achauffou commented 3 years ago

That's awesome! I will post the ones with the sampling tomorrow morning

bernibra commented 3 years ago

Are there any divergent transitions or high Rhat values popping up? A good tool to check the health of your chains is "launch_shinystan(stanfit)" from the shinystan R package.

achauffou commented 3 years ago

I forgot to mention it but there is no divergent transition and all Rhat values are very close to 1

bernibra commented 3 years ago

Sweet