arviz-devs / arviz

Exploratory analysis of Bayesian models with Python
https://python.arviz.org
Apache License 2.0
1.56k stars 388 forks source link

create Inferencedata for multivariate data #2275

Closed Qiustander closed 10 months ago

Qiustander commented 10 months ago

Short Description

Hi all, I am using Arviz for posterior analysis. I used Tensorflow Probability. I simulated the data (60, 10), where 60 represents the # of samples, 10 is the feature dimension. Then I draw some samples from prior distribution for prior posterior check, say, (200, 10).

Code Example or link

prior_observations = gen_prior_samples(200)

prior_trace = az.from_dict(
    observed_data={"observations": observations}, #(60,10)
    prior_predictive={"observations": prior_observations[tf.newaxis, ...]}, #(chain, 200, 10)
    coords={"observation": np.arange(signal_dim)}, # signal_dim =10
    dims={"observations": ["observation"]},
)
az.plot_ppc(prior_trace, group="prior", num_pp_samples=100)
plt.show()

Then the error happens when calling from_dict:

ValueError: conflicting sizes for dimension 'observation': length 60 on the data but length 10 on coordinate 'observation'

I also try "observation": np.arange(observation.shape[0]) but it does not work. I did not find any reference about plot_ppc for multivariate data, so where is the problem? Thanks

Arviz version: 0.16.1

ahartikainen commented 10 months ago

This fail when creating the inference data, nothing to do woth ppc specifically.

Your observations has wrong specification. You code implies that there are 10 observations but your data has 60 as the first dimension. So I would assume you need to define the first dimension too.

Qiustander commented 10 months ago

This fail when creating the inference data, nothing to do woth ppc specifically.

Your observations has wrong specification. You code implies that there are 10 observations but your data has 60 as the first dimension. So I would assume you need to define the first dimension too.

@ahartikainen Thanks for your reply. I have tried to add the first dimension before

prior_trace = az.from_dict(
    observed_data={"observations": observations.numpy()},
    prior_predictive={"observations": prior_observations[tf.newaxis, ...]},
    coords={"feature": np.arange(signal_dim), "samples":np.arange(observations.shape[0])},
    dims={"observations": ["samples", "feature"]},
)

but ValueError: different number of dimensions on data and dims: 3 vs 4. It seems that observations require to be 1 dim?

OriolAbril commented 10 months ago

The observations variable is present in both observed_data and prior_predictive but their shapes don't match. The variable in the prior_predictive group should have the same shape _excluding sample dimensions (chain and draw in this case).

Qiustander commented 10 months ago

@OriolAbril Hi, thanks for your reply. I am still a bit confused. I have 60 draws of observations, and each has 10 dimensions. So similar for prior_predictive, but with 200 draws, and 1 chain.

OK, I tried to use only one observations, say observed_data={"observations": observations[0]}, it did work, but the observed data plotted in plot_ppc is wired since there is only one observation. Could I use multiple observations?

ahartikainen commented 10 months ago

Draws we talk about here are a mcmc concept, you don't have draws in observed_data (in mcmc)

So for pp you either need dims in (chain, draws, odraws, obs)

Qiustander commented 10 months ago

Draws we talk about here are a mcmc concept, you don't have draws in observed_data (in mcmc)

So for pp you either need dims in (chain, draws, odraws, obs)

If I understand you correctly, I need to expand the dimension of the observed_data to 4 dimensions, that is, (chain, draws, odraws, obs). And I need to define the dimension name. However I fail to do it correctly:

prior_trace = az.from_dict(
    observed_data={"observations": observations.numpy()[tf.newaxis,tf.newaxis, ...]},
    prior_predictive={"observations": prior_observations[tf.newaxis, ...]},
    coords={"chain": np.arange(signal_dim),
"odraw": np.arange(observations.shape[0]),
            "feature": np.arange(signal_dim), "draw":np.arange(1)},
    dims={"observations": ["chain", "draw", "odraw", "feature"]},
)

Could you provide a minimal reproducible example? Thanks

ahartikainen commented 10 months ago

Observed data does not need to have chain and draw dimensions. It is the prior_predictive that needs those dimensions (chain, draw and then same shape as your observed data)

ahartikainen commented 10 months ago

So let's just clarify some things.

Observed data is something you gathered is considered static.

Posterior / prior predictive values are your "simulated" data points. Each data point (and element) in observed data has the corresponding data in predictive dataset (but instead of 1 value it has nchain*ndraw values)

Qiustander commented 10 months ago

Got it now! I misunderstood the prior/posterior predictive plot. Thanks