achauffou / how-random

MSc project on the predictability of ecological interactions carried out at ETH Zürich in 2021.
0 stars 0 forks source link

First results for pol_binom_02 and pol_binom_03 #8

Open achauffou opened 3 years ago

achauffou commented 3 years ago

Results of the first pollination models: I am currently running on the real dataset the two following models:

I will post here some results and thoughts on interpretation as well as any issue I encounter.

achauffou commented 3 years ago

First results for pol_binom_03: I have not made any plot worth showing here yet, but I have explored a little bit the Stan outcome. II have uploaded the rstan-fit object to polybox. There are a few things that I noticed:

achauffou commented 3 years ago

Regarding predictability: My initial idea on how to analyse preditability was to compute the ROC/AUC of the entire model, but also to compute separately AUC/ROC separately for each site/plant/pollinator (including only its data points). Then, in my mind, it would have been possible to compare AUCs/ROCs of the different sites/plants/pollinators to see if there are some that are more or less predictable...

When I compute the overall AUC of the model, it falls around 0.84 (which is extremely close to the simulation). I find that somehow weird. Even weirder, when computing the AUC of some sites individually (not all yet), they end up with very similar AUC (of about 0.77). Most likely, I have maybe done a mistake at some point in the code or my reasoning because it is unlikely that we get a predictability as high as with simulated data... I will go over it once again and try to think of what might be the problem.

achauffou commented 3 years ago

Some plots for the results of pol_binom_03: param_post_alpha.pdf param_post_beta.pdf param_post_gamma_pla.pdf param_post_gamma_pol.pdf param_post_lambda.pdf param_post_lambda_bar.pdf param_post_sigma_beta.pdf param_post_sigma_gamma_pla.pdf param_post_sigma_gamma_pol.pdf param_post_sigma_lambda.pdf auc.pdf roc.pdf

achauffou commented 3 years ago

Looking at these plots, I noticed that lambdas all have wide confidence intervals and that there is not much difference between sites. The easy way out would be to say that the effect of bioclimatic suitability is universal and does not depend on location but I think it these results come more likely from a lack of signal (maybe this effect is difficult to estimate in the presence of other parameters). I have a few reflections on this point:

bernibra commented 3 years ago

I think we should meet for this.

Before the meeting though, can you do a couple of things:

  1. Remove the random effect on the lambdas, so that all sites have the same, and run the model.
  2. Try the same models without gammas.
  3. Use WAIC to compare all the models (full model, model with the same lambda for all sites, model with no gammas but lambdas per site, model with no gammas and same lambda for all sites).
achauffou commented 3 years ago

Thanks for the feedback and suggestion, I will do that first thing tomorrow (I have emailed you to set a time for a meeting)

achauffou commented 3 years ago

Below are some plots for the models you suggested. Sorry for the ugly and not very useful plots for gamma_pla and gamma_pol, I should probably sample only a few illustrative parameters. I will keep working and compare their WAIC this afternoon.

pol_binon_02: The one with most datapoints, no lambda slope. param_post_alpha.pdf param_post_beta.pdf param_post_gamma_pla.pdf param_post_gamma_pol.pdf param_post_sigma_beta.pdf param_post_sigma_gamma_pla.pdf param_post_sigma_gamma_pol.pdf

pol_binom_04: With a single lambda for all sites. param_post_alpha.pdf param_post_beta.pdf param_post_gamma_pla.pdf param_post_gamma_pol.pdf param_post_lambda.pdf param_post_sigma_beta.pdf param_post_sigma_gamma_pla.pdf param_post_sigma_gamma_pol.pdf

pol_binom_05: No gammas, site-specific lambdas. param_post_alpha.pdf param_post_beta.pdf param_post_lambda.pdf param_post_lambda_bar.pdf param_post_sigma_beta.pdf param_post_sigma_lambda.pdf

pol_binom_06: No gammas, single lambda for all sites. param_post_alpha.pdf param_post_beta.pdf param_post_lambda.pdf param_post_sigma_beta.pdf

bernibra commented 3 years ago

Cool, there is a lot that we can learn from this. All these models tell part of the story and, on Friday, we will need to define what the next steps are. I'll try to give some thought on these steps before the meeting, but we will most likely need to brainstorm together about them (be prepared to think big!).

A few thoughts:

  1. In terms of the WAIC plots, I think you will have to add a "generated_quantities" block to the stan model to calculate and store the log-likelihood values. I can help you with that on Friday.
  2. I like the simpler models (pol_binom_06), and I think we need to add all interactions and make lambdas interaction-specific.
  3. We should also think about the model "pol_binom_04". This is the most reasonable model in my view, but I think we need to consider the gammas as "gamma_pol x gamma_pla". To do so, you would need to set sigma_gamma_pol and sigma_gamma_pla to 1, and consider the gammas as "sigma_gamma x gamma_pol x gamma_pla", where sigma_gamma is an exponentially distributed parameter (I think... we can talk about it further). Again, we should find a way to include all interaction types in this model (probably via an interaction for lambda and sigma_gamma).
  4. I am super interested in adding information regarding what species are "invasive". That is something that could be very worth to start considering. The questions that we could address with this information are reaaaaally nice. How do they get this information?
  5. Finally, we have many species but not thaat many. Therefore, for considering trait information, it is feasible that we could consider the gammas as multivariate normal distributions, where the covariance is defined by the traits (see Gaussian process; I can show you how to code this in Stan). For this, we would need trait information (while maybe not for your master's thesis, this could be really nice to explore).
achauffou commented 3 years ago

Awesome, your ideas are super interesting and I am looking forward to discuss them more tomorrow. Until then I put here where I am at now regarding your thoughts:

  1. That makes sense, I think I can manage to record the pointwise log-likelihood in the generated quantities block. I am considering to save the pointwise link (i.e. the probability of interaction) as well which is needed for some posterior predicitve checks. But I am concerned about two points:
    • Storing the pointwise log-likelihood (and possibly link) takes a lot of space and it will certainly make the Stanfit object much bigger. But I guess it is manageable for models with about 20,000 datapoints (and it seems that log-likelihood/link are definitely needed for many analyses).
    • Doing so will cause the model to take longer to run, since I could not implement it in a parallel function. For now I am really I manage to get a very low computation time thanks to the 48 parallel threads. But at the same time doing so in the generated quantities will require less computations than in the model and maybe I won't have to worry too much about it. I will try it and see how it goes (hopefully I can do it before tomorrow).
  2. Sounds good, I would need to make improve the model before including various interaction types, but that should not be too much of an issue. I will start working on that next week once I am sure that results for pollination only make some sense.
  3. I agree, and the good part is that merging the gammas together as a product will help a lot when working with several interaction types. Then, in my view we could have different interactions implemented as multilevel sigma_gamma and lambda.
  4. If I can manage to get information on invasion status of the species present in the model, it could yield very nice insight. I will look at how they do it in Fricke and Svenning.
  5. Also one of the ideas of extensions I have in mind. Since it is likely time-consuming and I am starting to get time-constrained for the master thesis, I keep it in mind but I won't look at it for now.

I will keep you updated until tomorrow.

bernibra commented 3 years ago

A quick comment regarding 1. I was calculating the log-likelihood values for models with 200000 points, and it worked fine (just very heavy files). I don't think it will be that much of a problem in your case. Also, the generated quantities block only runs at the end of the sampling, so you do not need to parallelise that (it should not take that much longer).

achauffou commented 3 years ago

Just a quick update about my latest struggles regarding the pollination models...

After including duplicates as replicates of a binomial distribution as you suggested in your email, I have performed all the pollination analyses once again. However, this time the diagnostics were not as nice as last time (although diagnostics of simulations are fine):

bernibra commented 3 years ago
achauffou commented 3 years ago

Great thanks for the advice