First results for pol_binom_02 and pol_binom_03

achauffou commented 3 years ago

Results of the first pollination models: I am currently running on the real dataset the two following models:

pol_binom_03: the one with intercepts and slope for bioclimatic suitability standardized product. I have included at this point all plants/pollinators for which enough bioclimatic occurrences are available to yield bioclimatic suitability values that have acceptable errors. Once these are removed, there are 19667 interactions, 72 sites, 475 plants and 342 pollinators left (the model already finished sampling for this one).
pol_binom_02: the one without the bioclimatic suitability predictor. It is comprised of only overall and per site/plant/pollinator intercepts. Most interactions could be included in this model (since there is no need for bioclimatic variables). In this model, there are 862325 interactions, 148 sites, 2512 plants and 4559 pollinators (it is quite a lot so the model will take 1-2 days to sample).

I will post here some results and thoughts on interpretation as well as any issue I encounter.

achauffou commented 3 years ago

First results for pol_binom_03: I have not made any plot worth showing here yet, but I have explored a little bit the Stan outcome. II have uploaded the rstan-fit object to polybox. There are a few things that I noticed:

Good news first, the diagnostic does not yield any warning, no divergent transition, no max treedepth reached and all Rhat are satisfying.
When looking at the posterior distribution of parameters, it seems that nothing unexpected happened.
For example, the 95% confidence interval of the alpha intercept is between -3.2 and -2.52, which is very low but seems reasonable since it is much more frequent that two partners have no been seen interacting than the opposite.
Another example is lambda_bar, which has a 95% confidence interval between -0.07 and 0.29, suggesting that it is likely that there is a positive influence of bioclimatic suitability on probability of interaction. I have not thought yet about what this magnitude means.
But the posterior of most parameters have quite large confidence intervals. At first sight, it seems like the parameters with the most data points have the smaller confidence intervals, which in my opinion indicates that it is probably not possible to make better estimates when there are only few data points for the site/plant/pollinator. Maybe I should set a threshold of the minimum number of data points to include a site/plant/pollinator in the analysis.

achauffou commented 3 years ago

Regarding predictability: My initial idea on how to analyse preditability was to compute the ROC/AUC of the entire model, but also to compute separately AUC/ROC separately for each site/plant/pollinator (including only its data points). Then, in my mind, it would have been possible to compare AUCs/ROCs of the different sites/plants/pollinators to see if there are some that are more or less predictable...

When I compute the overall AUC of the model, it falls around 0.84 (which is extremely close to the simulation). I find that somehow weird. Even weirder, when computing the AUC of some sites individually (not all yet), they end up with very similar AUC (of about 0.77). Most likely, I have maybe done a mistake at some point in the code or my reasoning because it is unlikely that we get a predictability as high as with simulated data... I will go over it once again and try to think of what might be the problem.

achauffou commented 3 years ago

Some plots for the results of pol_binom_03: param_post_alpha.pdf param_post_beta.pdf param_post_gamma_pla.pdf param_post_gamma_pol.pdf param_post_lambda.pdf param_post_lambda_bar.pdf param_post_sigma_beta.pdf param_post_sigma_gamma_pla.pdf param_post_sigma_gamma_pol.pdf param_post_sigma_lambda.pdf auc.pdf roc.pdf

achauffou commented 3 years ago

Looking at these plots, I noticed that lambdas all have wide confidence intervals and that there is not much difference between sites. The easy way out would be to say that the effect of bioclimatic suitability is universal and does not depend on location but I think it these results come more likely from a lack of signal (maybe this effect is difficult to estimate in the presence of other parameters). I have a few reflections on this point:

Maybe I should use a stricter threshold for the minimum amount of bioclimatic occurrences required, which would keep only plants/pollinators with very precise estimations of bioclimatic suitability
Maybe I should use two terms to see what happens (lambda_pol S_pol + lambda_pla S_pla), but I fear it might bring back some collinearity issues
It could be interesting to see how the model with bioclimatic suitability performs
If the location really does not influence the slope of bioclimatic suitability that much, maybe I would be better off using only one overall lambda parameter that is independent of site (I should not do that until I am sure that the site indeed has little to no effect on the lambda)

bernibra commented 3 years ago

I think we should meet for this.

Before the meeting though, can you do a couple of things:

Remove the random effect on the lambdas, so that all sites have the same, and run the model.
Try the same models without gammas.
Use WAIC to compare all the models (full model, model with the same lambda for all sites, model with no gammas but lambdas per site, model with no gammas and same lambda for all sites).

achauffou commented 3 years ago

Thanks for the feedback and suggestion, I will do that first thing tomorrow (I have emailed you to set a time for a meeting)

achauffou commented 3 years ago

Below are some plots for the models you suggested. Sorry for the ugly and not very useful plots for gamma_pla and gamma_pol, I should probably sample only a few illustrative parameters. I will keep working and compare their WAIC this afternoon.

pol_binon_02: The one with most datapoints, no lambda slope. param_post_alpha.pdf param_post_beta.pdf param_post_gamma_pla.pdf param_post_gamma_pol.pdf param_post_sigma_beta.pdf param_post_sigma_gamma_pla.pdf param_post_sigma_gamma_pol.pdf

pol_binom_04: With a single lambda for all sites. param_post_alpha.pdf param_post_beta.pdf param_post_gamma_pla.pdf param_post_gamma_pol.pdf param_post_lambda.pdf param_post_sigma_beta.pdf param_post_sigma_gamma_pla.pdf param_post_sigma_gamma_pol.pdf

pol_binom_05: No gammas, site-specific lambdas. param_post_alpha.pdf param_post_beta.pdf param_post_lambda.pdf param_post_lambda_bar.pdf param_post_sigma_beta.pdf param_post_sigma_lambda.pdf

pol_binom_06: No gammas, single lambda for all sites. param_post_alpha.pdf param_post_beta.pdf param_post_lambda.pdf param_post_sigma_beta.pdf

bernibra commented 3 years ago

Cool, there is a lot that we can learn from this. All these models tell part of the story and, on Friday, we will need to define what the next steps are. I'll try to give some thought on these steps before the meeting, but we will most likely need to brainstorm together about them (be prepared to think big!).

A few thoughts:

In terms of the WAIC plots, I think you will have to add a "generated_quantities" block to the stan model to calculate and store the log-likelihood values. I can help you with that on Friday.
I like the simpler models (pol_binom_06), and I think we need to add all interactions and make lambdas interaction-specific.
We should also think about the model "pol_binom_04". This is the most reasonable model in my view, but I think we need to consider the gammas as "gamma_pol x gamma_pla". To do so, you would need to set sigma_gamma_pol and sigma_gamma_pla to 1, and consider the gammas as "sigma_gamma x gamma_pol x gamma_pla", where sigma_gamma is an exponentially distributed parameter (I think... we can talk about it further). Again, we should find a way to include all interaction types in this model (probably via an interaction for lambda and sigma_gamma).
I am super interested in adding information regarding what species are "invasive". That is something that could be very worth to start considering. The questions that we could address with this information are reaaaaally nice. How do they get this information?
Finally, we have many species but not thaat many. Therefore, for considering trait information, it is feasible that we could consider the gammas as multivariate normal distributions, where the covariance is defined by the traits (see Gaussian process; I can show you how to code this in Stan). For this, we would need trait information (while maybe not for your master's thesis, this could be really nice to explore).

achauffou commented 3 years ago

Awesome, your ideas are super interesting and I am looking forward to discuss them more tomorrow. Until then I put here where I am at now regarding your thoughts:

That makes sense, I think I can manage to record the pointwise log-likelihood in the generated quantities block. I am considering to save the pointwise link (i.e. the probability of interaction) as well which is needed for some posterior predicitve checks. But I am concerned about two points:
- Storing the pointwise log-likelihood (and possibly link) takes a lot of space and it will certainly make the Stanfit object much bigger. But I guess it is manageable for models with about 20,000 datapoints (and it seems that log-likelihood/link are definitely needed for many analyses).
- Doing so will cause the model to take longer to run, since I could not implement it in a parallel function. For now I am really I manage to get a very low computation time thanks to the 48 parallel threads. But at the same time doing so in the generated quantities will require less computations than in the model and maybe I won't have to worry too much about it. I will try it and see how it goes (hopefully I can do it before tomorrow).
Sounds good, I would need to make improve the model before including various interaction types, but that should not be too much of an issue. I will start working on that next week once I am sure that results for pollination only make some sense.
I agree, and the good part is that merging the gammas together as a product will help a lot when working with several interaction types. Then, in my view we could have different interactions implemented as multilevel sigma_gamma and lambda.
If I can manage to get information on invasion status of the species present in the model, it could yield very nice insight. I will look at how they do it in Fricke and Svenning.
Also one of the ideas of extensions I have in mind. Since it is likely time-consuming and I am starting to get time-constrained for the master thesis, I keep it in mind but I won't look at it for now.

I will keep you updated until tomorrow.

bernibra commented 3 years ago

A quick comment regarding 1. I was calculating the log-likelihood values for models with 200000 points, and it worked fine (just very heavy files). I don't think it will be that much of a problem in your case. Also, the generated quantities block only runs at the end of the sampling, so you do not need to parallelise that (it should not take that much longer).

achauffou commented 3 years ago

Just a quick update about my latest struggles regarding the pollination models...

After including duplicates as replicates of a binomial distribution as you suggested in your email, I have performed all the pollination analyses once again. However, this time the diagnostics were not as nice as last time (although diagnostics of simulations are fine):

The two models that include pooled lambda parameters (with resp. w/o gammas) end up having 4 (resp. 8) divergent transitions out of 8000. I don't think that is a big deal since it there are very few (probably just bad luck when redoing the analyses). Maybe I should slightly increase the adapt_delta to see if it gets rid of these divergent transitions? Anyway I don't think this is alarming regarding the quality of those models and results (unless you disagree).
The models that include the species-specific effects as a product (sigma_gamma gamma_pla gamma_pol) have fewer than 0.001 effective draws per transition and large R_hat for all gamma_pla and gamma_pol. I am very confident that this indicate something wrong with the model but I am not quite sure what (although I suspect that the issue of multiplying two negative values giving a positive one might play a role). But since these models should not bring anything new to the results/discussion and there is a limited amount of time left for the master project, I think it is wiser to put that aside for now and focus on trying to get comparisons between interactions as well as writing the manuscript.

bernibra commented 3 years ago

Having 4 divergent transitions is not a big deal, do not worry about it (no need to increase delta).
The model including the product of parameters, if those parameters are not constrained to be positive, you'll have Rhat problems because different parameter values (positive and negative) can produce the same results. Therefore, the posterior distributions differ across chains. If you want to run that model, you would need to constrain the gammas to be strictly positive. That said, I totally agree with you that we should drop this model and focus on the other ones.

achauffou commented 3 years ago

Great thanks for the advice

achauffou / how-random

First results for pol_binom_02 and pol_binom_03 #8