arviz-devs / arviz

Exploratory analysis of Bayesian models with Python
https://python.arviz.org
Apache License 2.0
1.59k stars 393 forks source link

plot_ppc doesn't work with missing data #982

Open nathanbraun opened 4 years ago

nathanbraun commented 4 years ago

Hi there, per one of the exercises in Osvaldo's book, I went back and played around with the Coal Mining disaster problem referenced here: https://docs.pymc.io/notebooks/getting_started.html#Case-study-2:-Coal-mining-disasters.

Everything went fine, except when I tried to sample the posterior predictive distribution and plot the results using plot_ppc. I kept getting the error:

ValueError: x and y must have same first dimension, but have shapes (1,) and (0,)

Until I tried removing the two missing values from disaster_data, then it worked as expected.

amukh18 commented 4 years ago

I would like to work on this issue!

amukh18 commented 4 years ago

@nathanbraun May I please have a look at your code? It would help me reproduce the error and analyse it better.

amukh18 commented 4 years ago

@OriolAbril I am not able to introduce the posterior_predictive group to the InferenceData object required for plot_ppc. Could you please give me some suggestions?

OriolAbril commented 4 years ago

The cookbook example will probably help you. Note that to obtain posterior predictive samples, pm.sample_posterior_predictive must be called.

amukh18 commented 4 years ago

@OriolAbril I was able to reproduce the error with the help of the resource you linked me to. Thank you! I will try to analyse the error.

amukh18 commented 4 years ago

@OriolAbril Would imputing the missing data in some way minimally affect the data and solve the problem or would it greatly affect the data and worsen the problem?

OriolAbril commented 4 years ago

Sorry, I don't understand the question

ahartikainen commented 4 years ago

Imputing data is big no-no :)

https://mc-stan.org/docs/2_21/stan-users-guide/missing-data.html

amukh18 commented 4 years ago

@OriolAbril I meant to ask if imputing the data would be of any help. @ahartikainen Could you please tell me what "model is vectorised" means in the link you posted? I am also confused about how creating two different distribution variables for existing values and missing values helps here.